[00:00:05] PROBLEM - SSH on mw1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:24] RECOVERY - nutcracker port on mw1002 is OK: TCP OK - 0.000 second response time on port 11212 [00:02:35] RECOVERY - RAID on mw1012 is OK: OK: no RAID installed [00:03:14] RECOVERY - Disk space on mw1002 is OK: DISK OK [00:03:25] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: puppet fail [00:03:34] RECOVERY - DPKG on mw1002 is OK: All packages OK [00:04:04] RECOVERY - dhclient process on mw1002 is OK: PROCS OK: 0 processes with command name dhclient [00:04:24] RECOVERY - nutcracker process on mw1002 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:04:24] RECOVERY - SSH on mw1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [00:07:34] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:08:35] PROBLEM - nutcracker port on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:09:45] PROBLEM - DPKG on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:10:25] PROBLEM - dhclient process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:10:35] PROBLEM - SSH on mw1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:10:44] PROBLEM - nutcracker process on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:11:35] PROBLEM - Disk space on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:12:14] RECOVERY - RAID on mw1015 is OK: OK: no RAID installed [00:12:34] RECOVERY - SSH on mw1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [00:16:25] PROBLEM - nutcracker process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:54] PROBLEM - puppet last run on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:06] PROBLEM - configured eth on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:16] PROBLEM - DPKG on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:25] PROBLEM - RAID on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:25] PROBLEM - SSH on mw1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:21:15] PROBLEM - dhclient process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:15] PROBLEM - Disk space on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:22:45] PROBLEM - nutcracker port on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:22:54] PROBLEM - salt-minion processes on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:44] RECOVERY - nutcracker process on mw1008 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:24:44] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [00:24:54] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:27:34] RECOVERY - Disk space on mw1008 is OK: DISK OK [00:33:55] PROBLEM - Disk space on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:34:34] RECOVERY - Disk space on mw1002 is OK: DISK OK [00:35:15] PROBLEM - nutcracker process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:35:24] PROBLEM - nutcracker port on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:35:26] PROBLEM - salt-minion processes on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:35] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [00:40:45] PROBLEM - Disk space on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:04] ganglia and wmflabs seems awfully slow [00:50:50] ori ^ ? [00:51:12] are these servers ok? it took over a minute to load ganglia graphs [00:53:39] yurik: I’m seeing similar issues but was assuming it’s local to my connection… where are you located at the moment? [00:53:54] andrewbogott, SEA airport [00:54:03] ok, not local then :( [00:54:12] production wiki seems fine [00:54:45] wikitech slow for you too? [00:54:50] yes, a bit [00:54:50] I agree that normal wikis seem ok [00:55:15] hm, I would blame the misc-web varnishes but wikitech isn’t behind them [00:55:31] ganglia looks just fine from Germany [00:55:44] so yes, wikitech & wmflabs & ganglia - slow, wikipedias / mediawiki.org / grafana - fast [00:56:21] hoo, try http://searchdata.wmflabs.org/maps/ [00:58:11] 35 requests 2.89s (according to Firefox) [00:58:56] interesting. It has been 15 minutes, still loading [00:59:26] could this be just one of the links? [00:59:27] mtr it? [00:59:31] ?? [00:59:46] "mtr combines the functionality of the traceroute and ping programs in a single network diagnostic tool." [01:03:02] hoo, https://phabricator.wikimedia.org/P2469 [01:03:53] andrewbogott, can you run the same to compare? ^ [01:04:45] RECOVERY - SSH on mw1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [01:04:45] RECOVERY - nutcracker process on mw1002 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:04:45] RECOVERY - nutcracker port on mw1002 is OK: TCP OK - 0.000 second response time on port 11212 [01:04:45] RECOVERY - RAID on mw1002 is OK: OK: no RAID installed [01:04:45] RECOVERY - configured eth on mw1002 is OK: OK - interfaces up [01:05:05] RECOVERY - NTP on mw1002 is OK: NTP OK: Offset -0.0003345012665 secs [01:05:06] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 2 hours ago with 0 failures [01:05:11] yurik: it has dynamic output, how did you decide when to capture it? [01:05:24] RECOVERY - salt-minion processes on mw1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:05:30] andrewbogott, i waited for about 50 packets, and hit "p" to pause [01:05:34] its aggregate [01:05:44] RECOVERY - Disk space on mw1002 is OK: DISK OK [01:05:55] RECOVERY - DPKG on mw1002 is OK: All packages OK [01:05:56] ah, ok,I see the loss % is settling down now [01:06:18] are you using the same mtr searchdata.wmflabs.org server? [01:06:34] RECOVERY - dhclient process on mw1002 is OK: PROCS OK: 0 processes with command name dhclient [01:06:44] um, my paste is horribly formatted but I added it to your paste [01:06:47] yes, same host [01:07:10] andrewbogott, what if you wrap it with ``` [01:07:15] try editing it [01:07:23] or i can add it to the description [01:07:45] andrewbogott, not ', ` [01:07:48] the back tick [01:08:03] better [01:08:12] much [01:08:13] thx [01:08:14] RECOVERY - Host mr1-ulsfo.oob is UP: PING WARNING - Packet loss = 16%, RTA = 1007.95 ms [01:08:43] unfortunately I don’t know what’s normal in this context. [01:08:47] bblack, ^ [01:08:56] bblack, https://phabricator.wikimedia.org/P2469 [01:08:56] hey [01:08:59] paravoid, hi [01:09:05] I'm inflight with very slow wifi [01:09:10] i highly doubt 2000+ is normal [01:09:30] what is your IP? [01:09:42] no 2000 is not normal for sure [01:09:46] 104.129.192.111 [01:10:19] got it [01:11:38] paravoid: gerrit is slow, phab is not. Maybe that’s a clue. [01:11:39] !log deactivating eqiad<->GTT BGP peering, reported network issues (P2469) [01:11:41] paravoid, at least you have an inflight wifi. I almost never get it oversees [01:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:11:58] how are things looking now? [01:12:03] paravoid: fixed [01:12:08] great [01:12:24] yep, works for me too [01:12:25] thx! [01:12:30] thanks paravoid [01:12:32] andrewbogott: yes, it's a clue: gerrit is in eqiad, phab is behind misc-web-lb and hence geoip'ed probably to ulsfo [01:12:55] achievement unlocked: fix a site issue while in-flight [01:12:59] ah, I thought gerrit was also misc-web. Makes sense. [01:13:54] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: puppet fail [01:15:08] (03CR) 10Yuvipanda: "merged" [puppet] - 10https://gerrit.wikimedia.org/r/263229 (https://phabricator.wikimedia.org/T123192) (owner: 10Petrb) [01:15:15] yay [01:16:14] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 51 failures [01:19:34] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:19:44] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:19:56] PROBLEM - puppet last run on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:20:05] PROBLEM - DPKG on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:20:05] PROBLEM - configured eth on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:20:05] PROBLEM - SSH on mw1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:20:34] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:20:35] PROBLEM - RAID on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:21:14] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:21:15] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:25:55] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:29:08] paravoid, btw, don't we have some automated detection for this? [01:29:26] i thought we were paying some big money to a monitoring site [01:29:36] watchmouse is useless [01:29:44] we're not paying watchmouse [01:29:55] we're paying catchpoint and it should had detected that yeah [01:30:07] but it's also pretty useless in my experience [01:30:55] "Catchpoint is a therapeutic community interest company that ..." [01:31:06] * Reedy grins [01:31:56] sigh... i wonder if we can somehow build our own detection based on varnish stream itself - e.g. if connections from one of our peer suddenly drops/becomes fewer in rate/...? [01:32:23] not something i want to do though - i would rather outsource this [01:32:24] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:32:44] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [01:33:05] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [01:33:35] PROBLEM - Host mw1007 is DOWN: PING CRITICAL - Packet loss = 100% [01:35:34] RECOVERY - nutcracker process on mw1008 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:35:45] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [01:35:45] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:35:45] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [01:35:56] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:36:05] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [01:36:14] RECOVERY - Disk space on mw1011 is OK: DISK OK [01:36:24] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:36:25] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [01:36:25] RECOVERY - Disk space on mw1008 is OK: DISK OK [01:36:25] RECOVERY - configured eth on mw1008 is OK: OK - interfaces up [01:36:25] RECOVERY - DPKG on mw1008 is OK: All packages OK [01:36:35] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 47 minutes ago with 0 failures [01:36:44] RECOVERY - SSH on mw1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [01:36:44] RECOVERY - SSH on mw1008 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [01:36:44] RECOVERY - RAID on mw1008 is OK: OK: no RAID installed [01:36:44] RECOVERY - DPKG on mw1011 is OK: All packages OK [01:36:45] RECOVERY - configured eth on mw1011 is OK: OK - interfaces up [01:37:05] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [01:37:14] RECOVERY - RAID on mw1011 is OK: OK: no RAID installed [01:38:18] 6operations, 10netops: User connectivity issues to wikipedias; fine to phabricator et al - https://phabricator.wikimedia.org/T123211#1923763 (10Reedy) 3NEW [01:39:52] Hi! I can not look at my watchlist at wikidata ; it seems that some data are corrupted while I can see https://www.wikidata.org/wiki/Special:Watchlist?days=3&namespace=2&associated=1&action=submit I can *not* see https://www.wikidata.org/wiki/Special:Watchlist?days=3&namespace=0&associated=1&action=submit i.e. the main namespace ; I am using the account [[d:user:I18n]] [01:40:54] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:40:56] What do you see? [01:41:01] Does it fail to load? [01:41:09] Do you have a lot of pages on it? [01:41:10] This is since one week. It works everywhere else (at 700 other WMF wikis). It works on other namespaces as well. If datas are corrupted it is a main issue . I am a faultfinder with 1,300 written bug reports prior to 2012 at mediazilla [01:41:33] Reedy I get the general error for heavy trafic [01:41:39] varnish error? [01:42:25] My native language is German. I do not understand the meaning of the word varnish. [01:42:34] it's a piece of software [01:42:45] Can you paste the actual error you're getting? [01:43:14] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:43:39] It happens whenever I try to use [[spacial;Wathlist]]. Today I had the idea to try at some namespaces manipulating server urls from w:.en: [01:43:45] RECOVERY - nutcracker port on mw1007 is OK: TCP OK - 0.000 second response time on port 11212 [01:43:55] RECOVERY - Host mw1007 is UP: PING OK - Packet loss = 0%, RTA = 1.75 ms [01:44:09] moment Reedy It takes 3 minutes [01:44:16] RECOVERY - dhclient process on mw1007 is OK: PROCS OK: 0 processes with command name dhclient [01:44:16] RECOVERY - nutcracker process on mw1007 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:44:24] RECOVERY - configured eth on mw1007 is OK: OK - interfaces up [01:44:24] RECOVERY - DPKG on mw1007 is OK: All packages OK [01:44:36] RECOVERY - RAID on mw1007 is OK: OK: no RAID installed [01:44:36] RECOVERY - salt-minion processes on mw1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:44:50] Reedy 503 Service Temporarily Unavailable - the tab was still open [01:45:25] RECOVERY - Disk space on mw1007 is OK: DISK OK [01:45:35] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [01:45:55] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [01:46:48] Do you have a lot of pages on your watchlist? [01:46:56] specifically, in that ns? [01:47:23] not as many as at [[w:en:]] there might be 3 to 5 thusends [01:47:56] wikidata likely does some things to the watchlist queries [01:48:03] I'd suggest filing a bug in phabricator [01:48:14] Most people are travelling atm, so not a great deal that can be done [01:48:23] Reedy https://www.wikidata.org/wiki/Special:Watchlist?days=3&namespace=828&associated=1&action=submit is OK [01:48:25] I presume it's an sql query timeout or something [01:48:25] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:50:27] Reedy https://www.wikidata.org/wiki/Special:Watchlist?days=0&namespace=10&associated=1&action=submit but the list is empty I try 14 [01:51:12] Reedy 1,703 pages on your watchlist, not separately counting talk pages. [01:51:52] gangleri|home: You're going to have to file a bug in phabricator [01:53:07] Reedy I neither can log in at phabricator nor at my email account in Iceland. - can I write it at a wiki page? [01:53:42] Reedy https://www.wikidata.org/wiki/Special:EditWatchlist works for me [01:53:52] Why can't you log into phabricator? [01:55:46] If you've got a wikipedia account, you've got a mediawiki account, and such, you can log into phabricator [01:55:51] editwatchlist is completely different [01:57:16] I forgot my password. I used a PC with windows 10; now I am on Ubunu [01:57:55] trying to log in [02:00:03] You don't know your wiki password? [02:02:15] You could post at https://meta.wikimedia.org/wiki/Tech or whatever the relevant wikidata page is [02:11:06] gangleri|home: I filed https://phabricator.wikimedia.org/T123213 for you [02:11:58] thanks! [02:13:26] gangleri|home: when you try to log into phab, are you using the "Login with Mediawiki" button under the login field, or are putting your wiki login details directly on the fields on the login page? [02:17:57] Reedy p858snake I cliked on the sunflower and got somehow in I managed to edit the report [02:18:43] I have deleted 80 pages so far and will see when there is a break trough [02:25:43] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 10m 39s) [02:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:37] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Jan 11 02:32:37 UTC 2016 (duration 6m 55s) [02:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:10:51] (03PS3) 10Andrew Bogott: Send email to project admins if puppet runs are failing. [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) [04:11:52] (03CR) 10jenkins-bot: [V: 04-1] Send email to project admins if puppet runs are failing. [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) (owner: 10Andrew Bogott) [04:13:35] (03PS4) 10Andrew Bogott: Send email to project admins if puppet runs are failing. [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) [04:14:28] (03CR) 10jenkins-bot: [V: 04-1] Send email to project admins if puppet runs are failing. [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) (owner: 10Andrew Bogott) [04:20:12] (03PS5) 10Andrew Bogott: Send email to project admins if puppet runs are failing. [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) [04:50:34] PROBLEM - SSH on mw1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:50:44] PROBLEM - DPKG on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:45] PROBLEM - puppet last run on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:51:14] PROBLEM - nutcracker process on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:51:34] PROBLEM - dhclient process on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:51:35] PROBLEM - RAID on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:51:55] PROBLEM - salt-minion processes on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:05] PROBLEM - nutcracker port on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:05] PROBLEM - configured eth on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:06] PROBLEM - Disk space on mw1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:53:06] RECOVERY - nutcracker process on mw1003 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:53:25] RECOVERY - dhclient process on mw1003 is OK: PROCS OK: 0 processes with command name dhclient [04:53:34] RECOVERY - RAID on mw1003 is OK: OK: no RAID installed [04:53:45] RECOVERY - salt-minion processes on mw1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:54:04] RECOVERY - nutcracker port on mw1003 is OK: TCP OK - 0.000 second response time on port 11212 [04:54:04] RECOVERY - configured eth on mw1003 is OK: OK - interfaces up [04:54:05] RECOVERY - Disk space on mw1003 is OK: DISK OK [04:54:34] RECOVERY - SSH on mw1003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [04:54:44] RECOVERY - DPKG on mw1003 is OK: All packages OK [04:54:45] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [04:57:35] PROBLEM - puppet last run on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:36] PROBLEM - nutcracker process on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:36] PROBLEM - salt-minion processes on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:57:44] PROBLEM - dhclient process on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:05] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:06] PROBLEM - RAID on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:16] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:24] PROBLEM - Disk space on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:35] PROBLEM - configured eth on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:45] PROBLEM - SSH on mw1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:58:45] PROBLEM - SSH on mw1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:58:55] PROBLEM - DPKG on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:04] PROBLEM - RAID on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:04] PROBLEM - configured eth on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:15] PROBLEM - configured eth on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:24] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:59:24] PROBLEM - salt-minion processes on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:34] PROBLEM - nutcracker port on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:35] PROBLEM - DPKG on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:45] PROBLEM - dhclient process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:45] PROBLEM - DPKG on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:45] PROBLEM - RAID on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:46] PROBLEM - nutcracker port on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:06] PROBLEM - salt-minion processes on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:06] PROBLEM - nutcracker process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:25] PROBLEM - puppet last run on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:02:35] PROBLEM - nutcracker port on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:04:35] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:46] PROBLEM - Disk space on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:06:15] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: puppet fail [05:10:24] PROBLEM - Disk space on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:15:35] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:15:55] PROBLEM - RAID on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:16:05] PROBLEM - puppet last run on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:16:24] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:16:25] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:17:14] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:24] RECOVERY - nutcracker port on mw1014 is OK: TCP OK - 0.000 second response time on port 11212 [05:18:34] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [05:18:34] RECOVERY - nutcracker process on mw1014 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:18:34] RECOVERY - dhclient process on mw1014 is OK: PROCS OK: 0 processes with command name dhclient [05:18:34] RECOVERY - salt-minion processes on mw1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:19:15] RECOVERY - Disk space on mw1014 is OK: DISK OK [05:19:15] PROBLEM - Disk space on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:19:25] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:19:36] PROBLEM - RAID on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:19:44] PROBLEM - nutcracker port on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:19:45] PROBLEM - SSH on mw1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:19:55] PROBLEM - dhclient process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:20:24] PROBLEM - configured eth on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:20:24] PROBLEM - dhclient process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:20:24] PROBLEM - nutcracker process on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:20:24] PROBLEM - DPKG on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:20:24] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [05:20:25] PROBLEM - puppet last run on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:20:45] PROBLEM - RAID on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:20:46] PROBLEM - puppet last run on mw1011 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:21:54] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [05:22:54] PROBLEM - configured eth on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:23:05] PROBLEM - DPKG on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:23:24] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:23:25] RECOVERY - DPKG on mw1005 is OK: All packages OK [05:23:35] PROBLEM - RAID on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:24:35] PROBLEM - nutcracker port on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:24:54] PROBLEM - nutcracker process on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:24:54] PROBLEM - salt-minion processes on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:24:54] PROBLEM - salt-minion processes on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:24:54] RECOVERY - configured eth on mw1015 is OK: OK - interfaces up [05:24:55] PROBLEM - dhclient process on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:24:55] PROBLEM - SSH on mw1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:25:04] PROBLEM - configured eth on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:05] RECOVERY - DPKG on mw1015 is OK: All packages OK [05:25:05] PROBLEM - DPKG on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:34] PROBLEM - Disk space on mw1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:27:14] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:27:14] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:27:35] RECOVERY - Disk space on mw1007 is OK: DISK OK [05:28:15] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:15] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:35] PROBLEM - RAID on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:44] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:45] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:55] PROBLEM - SSH on mw1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:15] PROBLEM - salt-minion processes on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:34] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: puppet fail [05:29:35] PROBLEM - puppet last run on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:30:14] RECOVERY - nutcracker port on mw1007 is OK: TCP OK - 0.000 second response time on port 11212 [05:30:15] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:30:24] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:30:24] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:30:55] PROBLEM - nutcracker process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:04] PROBLEM - nutcracker port on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:25] RECOVERY - salt-minion processes on mw1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:31:35] PROBLEM - DPKG on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:35] PROBLEM - salt-minion processes on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:31:54] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [05:31:55] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:24] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:35] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:35] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:15] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:16] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:24] PROBLEM - Disk space on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:25] PROBLEM - configured eth on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:44] PROBLEM - nutcracker process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:33:44] PROBLEM - salt-minion processes on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:04] PROBLEM - Disk space on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:15] PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:24] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:35] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:36] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [05:34:44] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:34:45] PROBLEM - configured eth on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:55] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:35:04] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:35:04] RECOVERY - Disk space on mw1011 is OK: DISK OK [05:35:05] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [05:35:14] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [05:35:15] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:35:15] PROBLEM - puppet last run on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:35:24] PROBLEM - RAID on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:35:34] PROBLEM - RAID on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:35:35] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:35:55] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:36:04] PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:36:25] RECOVERY - Disk space on mw1005 is OK: DISK OK [05:36:34] PROBLEM - nutcracker port on mw1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:36:45] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:37:15] RECOVERY - nutcracker port on mw1015 is OK: TCP OK - 0.000 second response time on port 11212 [05:37:15] RECOVERY - nutcracker process on mw1015 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:37:34] RECOVERY - RAID on mw1009 is OK: OK: no RAID installed [05:37:35] PROBLEM - nutcracker port on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:36] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:44] PROBLEM - nutcracker process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:06] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:55] RECOVERY - configured eth on mw1004 is OK: OK - interfaces up [05:39:14] RECOVERY - RAID on mw1004 is OK: OK: no RAID installed [05:39:15] RECOVERY - dhclient process on mw1007 is OK: PROCS OK: 0 processes with command name dhclient [05:40:44] RECOVERY - nutcracker port on mw1007 is OK: TCP OK - 0.000 second response time on port 11212 [05:41:05] RECOVERY - DPKG on mw1006 is OK: All packages OK [05:41:24] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:34] RECOVERY - SSH on mw1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [05:41:34] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:35] RECOVERY - nutcracker port on mw1004 is OK: TCP OK - 0.000 second response time on port 11212 [05:41:44] RECOVERY - dhclient process on mw1004 is OK: PROCS OK: 0 processes with command name dhclient [05:41:44] RECOVERY - nutcracker process on mw1004 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:42:45] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:55] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:56] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:04] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:25] PROBLEM - nutcracker process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:25] PROBLEM - nutcracker port on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:26] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:44] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:44:15] RECOVERY - Disk space on mw1007 is OK: DISK OK [05:44:24] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [05:45:04] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:45:05] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:45:16] RECOVERY - configured eth on mw1007 is OK: OK - interfaces up [05:45:24] RECOVERY - nutcracker process on mw1007 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:45:24] RECOVERY - DPKG on mw1007 is OK: All packages OK [05:45:24] PROBLEM - RAID on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:45:34] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [05:45:34] RECOVERY - nutcracker port on mw1015 is OK: TCP OK - 0.000 second response time on port 11212 [05:45:44] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [05:45:45] RECOVERY - salt-minion processes on mw1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:45:54] PROBLEM - RAID on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:25] PROBLEM - SSH on mw1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:46:54] PROBLEM - DPKG on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:24] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:35] PROBLEM - configured eth on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:46] RECOVERY - RAID on mw1007 is OK: OK: no RAID installed [05:48:06] PROBLEM - salt-minion processes on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:48:25] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [05:49:05] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [05:49:05] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:49:34] RECOVERY - Disk space on mw1011 is OK: DISK OK [05:49:44] RECOVERY - configured eth on mw1009 is OK: OK - interfaces up [05:49:44] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:50:06] RECOVERY - configured eth on mw1011 is OK: OK - interfaces up [05:50:14] RECOVERY - DPKG on mw1011 is OK: All packages OK [05:50:34] RECOVERY - DPKG on mw1005 is OK: All packages OK [05:51:14] RECOVERY - dhclient process on mw1015 is OK: PROCS OK: 0 processes with command name dhclient [05:51:15] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:15] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:24] PROBLEM - configured eth on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:34] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [05:51:44] RECOVERY - nutcracker process on mw1015 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:51:56] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:52:05] RECOVERY - SSH on mw1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [05:52:24] RECOVERY - DPKG on mw1015 is OK: All packages OK [05:52:24] RECOVERY - salt-minion processes on mw1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:53:04] RECOVERY - Disk space on mw1006 is OK: DISK OK [05:53:04] PROBLEM - dhclient process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:53:04] RECOVERY - SSH on mw1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [05:53:45] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:14] PROBLEM - nutcracker port on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:15] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:15] PROBLEM - nutcracker process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:55:25] PROBLEM - Disk space on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:55:25] PROBLEM - nutcracker process on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:55:35] PROBLEM - nutcracker port on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:55:36] PROBLEM - Disk space on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:45] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:57:34] PROBLEM - dhclient process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:57:54] PROBLEM - RAID on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:57:55] PROBLEM - nutcracker port on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:58:05] PROBLEM - SSH on mw1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:15] RECOVERY - nutcracker port on mw1004 is OK: TCP OK - 0.000 second response time on port 11212 [05:58:16] RECOVERY - dhclient process on mw1004 is OK: PROCS OK: 0 processes with command name dhclient [05:58:16] RECOVERY - nutcracker process on mw1004 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:58:25] RECOVERY - salt-minion processes on mw1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:58:25] PROBLEM - SSH on mw1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:58:26] PROBLEM - configured eth on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:58:35] PROBLEM - DPKG on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:58:35] PROBLEM - DPKG on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:58:36] PROBLEM - salt-minion processes on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:15] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:25] PROBLEM - SSH on mw1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:59:44] PROBLEM - NTP on mw1014 is CRITICAL: NTP CRITICAL: No response from NTP server [06:01:54] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:01:55] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:01:55] RECOVERY - NTP on mw1014 is OK: NTP OK: Offset 0.0006107091904 secs [06:02:05] RECOVERY - nutcracker port on mw1014 is OK: TCP OK - 0.000 second response time on port 11212 [06:02:15] PROBLEM - nutcracker process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:02:15] PROBLEM - configured eth on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:02:24] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [06:02:25] PROBLEM - salt-minion processes on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:02:25] RECOVERY - nutcracker process on mw1014 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:02:26] RECOVERY - salt-minion processes on mw1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:02:26] RECOVERY - dhclient process on mw1014 is OK: PROCS OK: 0 processes with command name dhclient [06:02:35] RECOVERY - Disk space on mw1015 is OK: DISK OK [06:02:35] RECOVERY - configured eth on mw1015 is OK: OK - interfaces up [06:02:55] RECOVERY - salt-minion processes on mw1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:03:05] RECOVERY - Disk space on mw1014 is OK: DISK OK [06:03:25] RECOVERY - configured eth on mw1014 is OK: OK - interfaces up [06:03:34] RECOVERY - SSH on mw1014 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [06:03:34] RECOVERY - DPKG on mw1014 is OK: All packages OK [06:03:46] RECOVERY - RAID on mw1014 is OK: OK: no RAID installed [06:03:54] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:03:54] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [06:03:54] RECOVERY - dhclient process on mw1015 is OK: PROCS OK: 0 processes with command name dhclient [06:04:15] RECOVERY - nutcracker process on mw1015 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:04:15] RECOVERY - nutcracker port on mw1015 is OK: TCP OK - 0.000 second response time on port 11212 [06:04:45] PROBLEM - nutcracker port on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:46] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:54] PROBLEM - nutcracker process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:55] PROBLEM - salt-minion processes on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:55] RECOVERY - DPKG on mw1015 is OK: All packages OK [06:05:44] PROBLEM - NTP on mw1016 is CRITICAL: NTP CRITICAL: No response from NTP server [06:06:16] PROBLEM - puppet last run on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:06:44] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:07:14] PROBLEM - puppet last run on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:07:56] RECOVERY - Disk space on mw1004 is OK: DISK OK [06:08:04] RECOVERY - configured eth on mw1004 is OK: OK - interfaces up [06:08:04] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:08:05] PROBLEM - configured eth on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:08:24] RECOVERY - DPKG on mw1004 is OK: All packages OK [06:08:44] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:08:46] PROBLEM - RAID on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:08:54] RECOVERY - nutcracker port on mw1004 is OK: TCP OK - 0.000 second response time on port 11212 [06:08:55] RECOVERY - dhclient process on mw1004 is OK: PROCS OK: 0 processes with command name dhclient [06:09:04] RECOVERY - nutcracker process on mw1004 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:09:04] RECOVERY - salt-minion processes on mw1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:09:05] PROBLEM - DPKG on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:09:44] RECOVERY - RAID on mw1015 is OK: OK: no RAID installed [06:09:45] RECOVERY - SSH on mw1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [06:10:14] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:10:14] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:10:44] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:11:34] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:12:04] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [06:12:15] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:12:54] PROBLEM - SSH on mw1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:15] PROBLEM - Disk space on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:24] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:24] PROBLEM - configured eth on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:35] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:35] PROBLEM - Disk space on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:36] PROBLEM - puppet last run on mw2042 is CRITICAL: CRITICAL: puppet fail [06:14:45] PROBLEM - configured eth on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:45] PROBLEM - dhclient process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:45] PROBLEM - DPKG on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:05] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:05] PROBLEM - nutcracker port on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:14] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:14] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:15] PROBLEM - nutcracker process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:15] PROBLEM - salt-minion processes on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:55] PROBLEM - nutcracker process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:04] PROBLEM - nutcracker port on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:24] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:16:35] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [06:16:44] RECOVERY - Disk space on mw1008 is OK: DISK OK [06:16:55] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [06:17:35] RECOVERY - DPKG on mw1005 is OK: All packages OK [06:18:04] RECOVERY - Disk space on mw1005 is OK: DISK OK [06:18:04] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [06:18:15] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [06:18:16] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:18:16] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:18:16] RECOVERY - RAID on mw1005 is OK: OK: no RAID installed [06:18:16] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:18:25] PROBLEM - salt-minion processes on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:18:45] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:19:14] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [06:19:14] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:19:25] RECOVERY - DPKG on mw1012 is OK: All packages OK [06:20:24] RECOVERY - Disk space on mw1012 is OK: DISK OK [06:21:25] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:21:45] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [06:22:34] RECOVERY - Disk space on mw1004 is OK: DISK OK [06:22:46] RECOVERY - DPKG on mw1004 is OK: All packages OK [06:22:55] PROBLEM - Disk space on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:23:05] RECOVERY - SSH on mw1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [06:23:06] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [06:23:24] PROBLEM - NTP on mw1006 is CRITICAL: NTP CRITICAL: No response from NTP server [06:23:24] RECOVERY - nutcracker port on mw1004 is OK: TCP OK - 0.000 second response time on port 11212 [06:23:24] RECOVERY - dhclient process on mw1004 is OK: PROCS OK: 0 processes with command name dhclient [06:23:34] RECOVERY - salt-minion processes on mw1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:23:34] RECOVERY - nutcracker process on mw1004 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:24:34] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:24:35] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [06:25:05] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [06:25:44] PROBLEM - DPKG on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:28:54] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:14] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:24] PROBLEM - SSH on mw1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:29:26] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:45] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:46] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:15] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:04] RECOVERY - Disk space on mw1012 is OK: DISK OK [06:31:04] RECOVERY - configured eth on mw1012 is OK: OK - interfaces up [06:31:04] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:05] PROBLEM - Disk space on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:05] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:15] RECOVERY - RAID on mw1012 is OK: OK: no RAID installed [06:31:25] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: puppet fail [06:31:34] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [06:31:34] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - dhclient process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:35] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [06:31:45] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:54] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:31:54] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] RECOVERY - DPKG on mw1012 is OK: All packages OK [06:32:04] PROBLEM - nutcracker process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:32:05] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 55 minutes ago with 0 failures [06:32:15] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:24] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:04] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:05] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:14] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:33:34] RECOVERY - DPKG on mw1004 is OK: All packages OK [06:33:35] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 4 failures [06:33:45] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:56] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:34] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [06:35:15] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [06:35:15] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:35:35] RECOVERY - Disk space on mw1011 is OK: DISK OK [06:35:45] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:38:14] RECOVERY - dhclient process on mw1004 is OK: PROCS OK: 0 processes with command name dhclient [06:38:16] RECOVERY - nutcracker process on mw1004 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:39:25] PROBLEM - salt-minion processes on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:39:45] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:25] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:45] RECOVERY - puppet last run on mw2042 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:41:45] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:56] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:44] PROBLEM - cassandra service on restbase1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [06:42:44] PROBLEM - cassandra CQL 10.64.0.220:9042 on restbase1001 is CRITICAL: Connection refused [06:44:25] PROBLEM - nutcracker port on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:44:26] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:44:45] RECOVERY - cassandra service on restbase1001 is OK: OK - cassandra is active [06:46:54] RECOVERY - cassandra CQL 10.64.0.220:9042 on restbase1001 is OK: TCP OK - 0.006 second response time on port 9042 [06:47:45] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:35] RECOVERY - dhclient process on mw1004 is OK: PROCS OK: 0 processes with command name dhclient [06:49:05] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:52:15] RECOVERY - SSH on mw1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [06:52:35] RECOVERY - NTP on mw1006 is OK: NTP OK: Offset -0.0602003336 secs [06:53:05] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [06:53:34] RECOVERY - Disk space on mw1006 is OK: DISK OK [06:53:35] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [06:54:14] RECOVERY - Disk space on mw1011 is OK: DISK OK [06:54:44] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:45] PROBLEM - salt-minion processes on mw1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:45] PROBLEM - nutcracker process on mw1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:56:24] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:57:15] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:25] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:25] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:57:54] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:55] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:58:04] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [06:58:25] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:26] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:35] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:36] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:58:45] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:46] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:04] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [06:59:06] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:15] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:59:16] RECOVERY - RAID on mw1013 is OK: OK: no RAID installed [06:59:24] RECOVERY - nutcracker port on mw1013 is OK: TCP OK - 0.000 second response time on port 11212 [06:59:24] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 2 hours ago with 0 failures [06:59:26] PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:34] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [06:59:56] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:00:14] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:00:24] RECOVERY - configured eth on mw1013 is OK: OK - interfaces up [07:00:24] RECOVERY - Disk space on mw1013 is OK: DISK OK [07:00:24] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:00:24] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [07:00:24] RECOVERY - DPKG on mw1013 is OK: All packages OK [07:04:26] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:04:45] RECOVERY - DPKG on mw1004 is OK: All packages OK [07:05:04] PROBLEM - SSH on mw1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:06:05] RECOVERY - configured eth on mw1006 is OK: OK - interfaces up [07:06:14] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [07:06:34] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [07:06:35] RECOVERY - DPKG on mw1006 is OK: All packages OK [07:06:36] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:55] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:55] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 2 hours ago with 0 failures [07:07:15] RECOVERY - RAID on mw1006 is OK: OK: no RAID installed [07:07:34] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:07:44] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:07:54] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [07:08:04] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:44] RECOVERY - salt-minion processes on mw1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:10:55] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:11:15] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:12:25] RECOVERY - NTP on mw1016 is OK: NTP OK: Offset -0.3438969851 secs [07:12:54] RECOVERY - Disk space on mw1004 is OK: DISK OK [07:12:54] RECOVERY - configured eth on mw1004 is OK: OK - interfaces up [07:12:55] RECOVERY - DPKG on mw1004 is OK: All packages OK [07:13:04] RECOVERY - RAID on mw1004 is OK: OK: no RAID installed [07:13:14] RECOVERY - SSH on mw1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [07:13:45] RECOVERY - dhclient process on mw1004 is OK: PROCS OK: 0 processes with command name dhclient [07:14:54] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:15:04] RECOVERY - Disk space on mw1011 is OK: DISK OK [07:15:16] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:15:55] PROBLEM - salt-minion processes on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:55] RECOVERY - configured eth on mw1011 is OK: OK - interfaces up [07:15:56] RECOVERY - DPKG on mw1011 is OK: All packages OK [07:16:15] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [07:17:38] <_joe_> !log restarting HHVM on a few jobrunners [07:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:19:05] PROBLEM - configured eth on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:15] PROBLEM - RAID on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:15] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:16] PROBLEM - puppet last run on mw1169 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:19:34] PROBLEM - SSH on mw1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:19:56] RECOVERY - salt-minion processes on mw1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:19:56] RECOVERY - nutcracker process on mw1004 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:21:04] RECOVERY - configured eth on mw1004 is OK: OK - interfaces up [07:21:14] RECOVERY - DPKG on mw1004 is OK: All packages OK [07:21:14] RECOVERY - RAID on mw1004 is OK: OK: no RAID installed [07:21:25] RECOVERY - SSH on mw1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [07:21:35] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:21:45] RECOVERY - nutcracker port on mw1004 is OK: TCP OK - 0.000 second response time on port 11212 [07:22:14] PROBLEM - configured eth on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:14] PROBLEM - DPKG on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:22:15] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 2 hours ago with 0 failures [07:22:25] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:05] RECOVERY - configured eth on mw1016 is OK: OK - interfaces up [07:23:44] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:23:55] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [07:23:55] RECOVERY - Disk space on mw1016 is OK: DISK OK [07:23:55] RECOVERY - dhclient process on mw1016 is OK: PROCS OK: 0 processes with command name dhclient [07:23:55] RECOVERY - DPKG on mw1016 is OK: All packages OK [07:23:56] RECOVERY - RAID on mw1016 is OK: OK: no RAID installed [07:24:05] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:24:05] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:24:24] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 2 hours ago with 0 failures [07:24:45] RECOVERY - SSH on mw1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [07:29:35] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:29:55] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:33:45] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:35:14] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [07:35:14] RECOVERY - nutcracker process on mw1008 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:35:34] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:35:54] RECOVERY - Disk space on mw1008 is OK: DISK OK [07:35:54] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [07:36:06] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [07:36:06] RECOVERY - configured eth on mw1008 is OK: OK - interfaces up [07:36:06] RECOVERY - DPKG on mw1008 is OK: All packages OK [07:36:25] RECOVERY - SSH on mw1008 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [07:36:34] RECOVERY - RAID on mw1008 is OK: OK: no RAID installed [07:36:56] RECOVERY - SSH on mw1009 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [07:37:15] RECOVERY - dhclient process on mw1009 is OK: PROCS OK: 0 processes with command name dhclient [07:37:15] RECOVERY - DPKG on mw1009 is OK: All packages OK [07:37:45] RECOVERY - nutcracker process on mw1009 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:37:54] RECOVERY - Disk space on mw1009 is OK: DISK OK [07:37:54] RECOVERY - nutcracker port on mw1009 is OK: TCP OK - 0.000 second response time on port 11212 [07:37:55] RECOVERY - configured eth on mw1009 is OK: OK - interfaces up [07:38:15] RECOVERY - salt-minion processes on mw1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:38:35] RECOVERY - RAID on mw1009 is OK: OK: no RAID installed [07:41:06] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:41:25] PROBLEM - NTP on mw1011 is CRITICAL: NTP CRITICAL: No response from NTP server [07:44:15] RECOVERY - Disk space on mw1011 is OK: DISK OK [07:50:25] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:20] (03PS1) 10Subramanya Sastry: WIP: Make testreduce generic and instantiate parsoid-rt services [puppet] - 10https://gerrit.wikimedia.org/r/263322 (https://phabricator.wikimedia.org/T118778) [08:14:15] (03CR) 10jenkins-bot: [V: 04-1] WIP: Make testreduce generic and instantiate parsoid-rt services [puppet] - 10https://gerrit.wikimedia.org/r/263322 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [08:34:01] (03PS21) 10Giuseppe Lavagetto: etcd: auth puppetization [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [08:35:35] RECOVERY - RAID on mw1011 is OK: OK: no RAID installed [08:35:54] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [08:36:05] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:36:14] RECOVERY - Disk space on mw1011 is OK: DISK OK [08:36:35] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:37:06] RECOVERY - SSH on mw1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [08:37:14] RECOVERY - DPKG on mw1011 is OK: All packages OK [08:37:14] RECOVERY - configured eth on mw1011 is OK: OK - interfaces up [08:37:26] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [08:39:15] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:39:44] RECOVERY - NTP on mw1011 is OK: NTP OK: Offset -0.008176326752 secs [08:48:36] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd.py: remove unused local variable 'e' [debs/pybal] - 10https://gerrit.wikimedia.org/r/263022 (owner: 10Ema) [08:48:50] <_joe_> ema: your first patch is merged :P [08:49:56] (03Merged) 10jenkins-bot: etcd.py: remove unused local variable 'e' [debs/pybal] - 10https://gerrit.wikimedia.org/r/263022 (owner: 10Ema) [08:49:58] \o/ [08:50:21] * ema contributes fundamental changes from day 1 [08:58:55] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "While the code seems nice in general, I would love us to use some local scap3 command instead of needing to reproduce its functionality in" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/262742 (owner: 10Alexandros Kosiaris) [08:59:47] (03CR) 10Giuseppe Lavagetto: [C: 032] Use a more useful error message when DB connection fails [software/dbtree] - 10https://gerrit.wikimedia.org/r/251791 (owner: 10Alex Monk) [08:59:59] (03CR) 10Giuseppe Lavagetto: [V: 032] Use a more useful error message when DB connection fails [software/dbtree] - 10https://gerrit.wikimedia.org/r/251791 (owner: 10Alex Monk) [09:21:33] Quick question: [09:21:33] Imagine I wrote MediaWiki code that ran a query like "DELETE FROM table WHERE table_col = 'abc'" where there are say 20,000 rows matched and table_col isn't the primary key [09:21:43] Would that cause bad effects? [09:23:05] <_joe_> tto: it depends, ofc [09:23:14] <_joe_> but that looks like a bad query anyways [09:23:15] PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: puppet fail [09:23:35] _joe_: Chunking deletes looks quite difficult, our DB abstraction layer seems to lack support for it [09:23:36] <_joe_> first of all, you need to delete in batches, using LIMIT [09:24:06] <_joe_> tto: I'm pretty sure it's not that hard, but I don't use mw's ORM a lot [09:24:44] Only MySQL seems to have proper support for it [09:24:52] <_joe_> yes [09:25:00] How big would you make the chunks? [09:25:10] <_joe_> we run mysql in production, I guessed this was a prod-related question [09:25:14] <_joe_> was I mistaken? [09:25:20] Yes, but tests use sqlite etc [09:25:31] And we theoretically support other DBs, don't we? [09:25:36] But yes, it's a production question :) [09:25:51] <_joe_> well, a delete without a limit is a ticking bomb as far as prod is concerned [09:26:17] 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1923927 (10ori) This task will be one year old this Friday. [09:26:44] <_joe_> not using an index can be a problem too, depending on how big the table is [09:26:56] <_joe_> the type of storage engine used, etc [09:27:03] On enwiki the change_tag table has 15.6 million rows [09:27:28] We already run unchunked deletes of up to 5000 rows on that table, without seeming to cause bad effects [09:27:54] <_joe_> the fact we don't have bad effects now doesn't mean it's a good idea in general [09:28:02] <_joe_> brb [09:28:06] sure, thanks :) [09:28:44] <_joe_> but if you want a more authoritative opinion, you'd have to wait for jynus to be back next week I guess [09:35:20] tto: select ids with a limit, and then delete in a loop [09:36:03] legoktm: That would be tantamount to deleting one row at a time though, wouldn't that make the DB go under too? [09:36:20] er, delete all the ids you select in one command [09:36:39] Trouble is, the change_tag table doesn't have a primary key [09:36:58] ah...that seems like something to fix first :) [09:37:05] Sigh... [09:37:16] I think there's a bug somewhere for tables missing primary keys? [09:37:26] There is indeed. [09:37:31] I'll go find it [09:50:25] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:56:34] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [09:56:45] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [10:00:05] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 625 [10:00:33] ACKNOWLEDGEMENT - DPKG on etherpad1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages Giuseppe Lavagetto The nodejs package is on hold [10:11:45] PROBLEM - puppet last run on mw2025 is CRITICAL: CRITICAL: puppet fail [10:20:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 1791802 Threads: 3 Questions: 43127452 Slow queries: 18409 Opens: 60042 Flush tables: 2 Open tables: 416 Queries per second avg: 24.069 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 33 [10:27:35] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [10:30:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 634 [10:30:34] PROBLEM - Hadoop NodeManager on analytics1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:32:35] RECOVERY - Hadoop NodeManager on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:35:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 1792702 Threads: 3 Questions: 43136262 Slow queries: 18427 Opens: 60042 Flush tables: 2 Open tables: 416 Queries per second avg: 24.062 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:36:30] <_joe_> !log updating nodejs on restbase-test2002 [10:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:37:35] RECOVERY - DPKG on restbase-test2002 is OK: All packages OK [10:41:05] RECOVERY - puppet last run on mw2025 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:41:27] 6operations, 10Continuous-Integration-Infrastructure, 7HHVM: HHVM Jenkins job throw: Unable to set CoreFileSize to 8589934592: Operation not permitted (1) - https://phabricator.wikimedia.org/T78799#1923965 (10hashar) 5Open>3Resolved a:3hashar Does not occur anymore. We always had `hvm.debug.core_dump_... [10:44:38] ACKNOWLEDGEMENT - DPKG on ruthenium is CRITICAL: DPKG CRITICAL dpkg reports broken packages Giuseppe Lavagetto nodejs is currently on hold. [10:45:35] RECOVERY - puppet last run on restbase-test2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:49:24] RECOVERY - salt-minion processes on sca1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:50:04] RECOVERY - salt-minion processes on sca1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:54:24] ACKNOWLEDGEMENT - DPKG on maps-test2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages Giuseppe Lavagetto nodejs is on hold [10:54:55] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:14:09] <_joe_> !log upgrading etcd to 2.2.1 in production [11:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:28:41] (03PS1) 10Muehlenhoff: Update to 3.19.8-ckt12 [debs/linux] - 10https://gerrit.wikimedia.org/r/263337 [11:53:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 3.19.8-ckt12 [debs/linux] - 10https://gerrit.wikimedia.org/r/263337 (owner: 10Muehlenhoff) [12:34:29] (03PS1) 10Pmlineditor: Added Simple English Wikipedia as import source for English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263341 (https://phabricator.wikimedia.org/T123212) [12:38:27] 6operations, 10ops-codfw: ms-be2007 - System halted!Error: Integrated RAID - https://phabricator.wikimedia.org/T122844#1924011 (10fgiunchedi) spoke with @papaul on friday, controller will likely need to be replaced [12:49:04] (03PS1) 10Mdann52: Localisation Babel categories on nap.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263342 (https://phabricator.wikimedia.org/T123188) [13:08:43] (03PS4) 10Mdann52: Add 2 sites to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) [13:40:40] (03PS1) 10Filippo Giunchedi: admin: remove gwicke from {parsoid,restbase,cassandra-test}-roots [puppet] - 10https://gerrit.wikimedia.org/r/263346 [13:52:35] (03PS2) 10Filippo Giunchedi: admin: remove gwicke from {parsoid,restbase}-roots [puppet] - 10https://gerrit.wikimedia.org/r/263346 [14:01:06] (03CR) 10Nikerabbit: [C: 04-1] Localisation Babel categories on nap.wikipedia.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263342 (https://phabricator.wikimedia.org/T123188) (owner: 10Mdann52) [14:02:38] (03CR) 10Mark Bergsma: [C: 032] admin: remove gwicke from {parsoid,restbase}-roots [puppet] - 10https://gerrit.wikimedia.org/r/263346 (owner: 10Filippo Giunchedi) [14:18:20] (03PS1) 10Filippo Giunchedi: admin: introduce restbase-admins group [puppet] - 10https://gerrit.wikimedia.org/r/263349 [14:22:04] (03CR) 10Mark Bergsma: [C: 032] admin: introduce restbase-admins group [puppet] - 10https://gerrit.wikimedia.org/r/263349 (owner: 10Filippo Giunchedi) [14:24:24] PROBLEM - puppet last run on restbase1006 is CRITICAL: CRITICAL: Puppet has 1 failures [14:26:34] RECOVERY - puppet last run on restbase1006 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:46:22] (03CR) 10Alex Monk: Send email to project admins if puppet runs are failing. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) (owner: 10Andrew Bogott) [15:04:18] 6operations: setup/deploy auth1001(WMF4576) as eqiad auth system - https://phabricator.wikimedia.org/T121655#1924168 (10MoritzMuehlenhoff) a:5MoritzMuehlenhoff>3Cmjohnson [15:04:50] 6operations, 10ops-codfw: rack/setup/deploy auth2001 as codfw auth system - https://phabricator.wikimedia.org/T120263#1924171 (10MoritzMuehlenhoff) I gave the YubiHSM to Papaul at the allhands, reassigning. [15:05:01] 6operations, 10ops-codfw: rack/setup/deploy auth2001 as codfw auth system - https://phabricator.wikimedia.org/T120263#1924172 (10MoritzMuehlenhoff) a:5MoritzMuehlenhoff>3Papaul [15:05:30] !log repool restbase1004 in pybal, fully bootstrapped and running latest code [15:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:36] (03PS1) 10Hoo man: Set NS_MODULE as a Wikidata client NS on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263354 (https://phabricator.wikimedia.org/T123234) [15:19:05] Did we go down after the Bowie news hit? [15:20:03] nope [15:21:27] Nice. [15:21:27] (03PS2) 10Hoo man: Set NS_MODULE as a Wikibase client NS on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263354 (https://phabricator.wikimedia.org/T123234) [15:36:56] (03CR) 10Muehlenhoff: [C: 04-1] "Yes, paged searches will return all pages, but are internally performed in chunks of the page size (1000 here). My patch needs more work, " [puppet] - 10https://gerrit.wikimedia.org/r/262745 (owner: 10Muehlenhoff) [15:39:04] PROBLEM - puppet last run on mw2066 is CRITICAL: CRITICAL: puppet fail [15:46:02] 6operations: onboarding Emanuele Rocca - https://phabricator.wikimedia.org/T123089#1924234 (10MoritzMuehlenhoff) I've added Emanuele to pwstore (based on the username he intends to use). [15:53:45] SWATATAT [15:54:10] BREAK ALL THE THINGS [15:54:18] :) [15:57:11] * James_F waves. [15:58:13] hey [15:59:10] I was going to demo SWAT process to the releng folks here in the CQ lobby, so I'll grab SWAT if that's ok. [15:59:36] sure [15:59:48] cool, thanks :) [15:59:49] thcipriani: The office is open. ;-P [15:59:57] don't all of the releng people know how already? [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160111T1600). Please do the needful. [16:00:05] James_F: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:10] * James_F sits here, alone. [16:00:12] James_F: I made it down the stairs, that's all I had in me. [16:00:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262352 (https://phabricator.wikimedia.org/T122808) (owner: 10Shanmugamp7) [16:00:59] (03Merged) 10jenkins-bot: Add enwiki as transwiki import source for ta.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262352 (https://phabricator.wikimedia.org/T122808) (owner: 10Shanmugamp7) [16:01:02] * James_F laughs. [16:01:02] (03CR) 1020after4: [C: 032] Set wgLocaltimezone for orwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260745 (https://phabricator.wikimedia.org/T122273) (owner: 10Reedy) [16:01:40] (03CR) 1020after4: [C: 032] Tidy robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [16:01:43] (03Merged) 10jenkins-bot: Set wgLocaltimezone for orwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260745 (https://phabricator.wikimedia.org/T122273) (owner: 10Reedy) [16:02:20] (03Merged) 10jenkins-bot: Tidy robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [16:02:26] (03CR) 1020after4: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [16:03:50] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add enwiki as transwiki import source for ta.wikipedia [[gerrit:262352]] (duration: 00m 33s) [16:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:03:56] ^ James_F check please [16:04:41] * James_F looks. [16:05:56] thcipriani: orwiki one is good. [16:06:00] kk [16:06:12] tawiki one is good. [16:06:23] You've not synced robots.txt yet, right? [16:07:39] right [16:08:18] OK, ready for the next batch. [16:08:18] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Set wgLocaltimezone for orwiki [[gerrit:260745]] (duration: 00m 29s) [16:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:08:35] RECOVERY - puppet last run on mw2066 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:09:43] !log thcipriani@tin Synchronized robots.txt: SWAT: Tidy robots.txt [[gerrit:240065]] (duration: 00m 30s) [16:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:50] ^ James_F check please. [16:11:45] Yup, robots.txt looks good. [16:11:56] :) [16:12:03] (03CR) 10Nemo bis: "This should not have been merged. My comment has not been addressed and this code is broken. On the bright side, that means it won't have " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [16:12:20] james_f must be a robot....lookin at robots.txt [16:12:39] (03PS2) 10Mdann52: Localisation of Babel categories on nap.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263342 (https://phabricator.wikimedia.org/T123188) [16:12:41] * twentyafterfour is lookin at fatalmonitor like a boss [16:12:46] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261902 (https://phabricator.wikimedia.org/T122732) (owner: 10Base) [16:12:57] (e.g. being useless) [16:13:29] (03Merged) 10jenkins-bot: Added noindex rule for uawikimedia's user namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261902 (https://phabricator.wikimedia.org/T122732) (owner: 10Base) [16:13:31] (03CR) 10Jforrester: "> this code is broken" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [16:14:02] (03CR) 10Nemo bis: "Explanations were already provided in October." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [16:16:13] (03CR) 10Jforrester: "> Explanations were already provided in October." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [16:16:17] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Added noindex rule for uawikimedia user namespace [[gerrit:261902]] (duration: 00m 30s) [16:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:26] ^ James_F check please [16:17:54] (03CR) 10Mdann52: "It won't do anything, no - however, the patch only mentions updating the user-agent, not what else it needed to do. It's up-to-date, I did" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [16:18:11] (03PS3) 10Mdann52: Localisation of Babel categories on nap.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263342 (https://phabricator.wikimedia.org/T123188) [16:19:39] thcipriani: Yeah, uawikimedia LGTM. [16:20:04] James_F: cool, thanks. [16:20:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) (owner: 10Luke081515) [16:20:31] (03PS1) 10Nemo bis: Remove overeager unrequested /wiki/User: robots.txt rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 [16:20:56] (03Merged) 10jenkins-bot: Changed user group rights at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) (owner: 10Luke081515) [16:22:05] 6operations: setup/deploy auth1001(WMF4576) as eqiad auth system - https://phabricator.wikimedia.org/T121655#1924289 (10Cmjohnson) [16:22:07] 6operations, 10ops-eqiad: update physical label for auth1001(WMF4576) - https://phabricator.wikimedia.org/T121703#1924288 (10Cmjohnson) 5Open>3Resolved [16:22:54] (03CR) 10Jforrester: [C: 04-1] "Oh, this is what you were talking about? Clarity helps. Yeah, I missed this in reviewing the inter-version diffs, sorry." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 (owner: 10Nemo bis) [16:22:57] (03CR) 10Mdann52: "Scratch what I said, that was an old version. This should work as intended" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [16:23:28] 6operations: setup/deploy auth1001(WMF4576) as eqiad auth system - https://phabricator.wikimedia.org/T121655#1924290 (10Cmjohnson) a:5Cmjohnson>3MoritzMuehlenhoff attached the yubikey to auth1001. Assigning to @Moritz Dmesg below 2237965.223626] usb 1-1.2: new full-speed USB device number 5 using ehci-pci... [16:23:46] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Changed user group rights at trwikiquote [[gerrit:261869]] (duration: 00m 30s) [16:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:54] 7Blocked-on-Operations, 6operations, 10RESTBase, 6Services: Switch RESTBase to use Node.js 4.2 - https://phabricator.wikimedia.org/T107762#1924292 (10GWicke) GC behavior continues to look better for 4.2 than 0.10: {F3218939} Peak memory usage is higher than with 0.10, but collections generally seem to ha... [16:23:55] ^ James_F check please. [16:24:51] (03CR) 10Nemo bis: "If you think that this is easier to get merged by removing Benutzer too, sure." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 (owner: 10Nemo bis) [16:24:51] thcipriani: Yup, trwikiquote LGTM. [16:25:02] awesome, thanks. [16:25:17] thcipriani: Looks like we may want to modify robots.txt a little more, sorry. [16:25:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257868 (https://phabricator.wikimedia.org/T120568) (owner: 10Glaisher) [16:25:39] James_F: watching the discussion, no problem. [16:26:03] (03Merged) 10jenkins-bot: Enable global AbuseFilter at French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257868 (https://phabricator.wikimedia.org/T120568) (owner: 10Glaisher) [16:26:22] (03CR) 10Nemo bis: "Mdann52, indeed it should not have been uncommented without good reasons and documentation. For now I sent a followup for the part which I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [16:27:39] (03CR) 10Mdann52: "Related here : https://phabricator.wikimedia.org/T7582 (original bug where this was implemented)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 (owner: 10Nemo bis) [16:27:43] 6operations, 10ops-eqiad: ms-be1013 drac is not reachable via ssh - https://phabricator.wikimedia.org/T123086#1924299 (10fgiunchedi) @cmjohnson, works for me, ping me on irc before doing it and I'll shut the machine [16:27:57] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable global AubseFilter at French Wikipedia [[gerrit:257868]] (duration: 00m 29s) [16:28:00] ^ James_F check please [16:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:03] godog: go ahead and shut it down if you can [16:28:11] (ms-be10130 [16:28:24] (03CR) 10Nemo bis: "Do you mean "this" as in the removal of the rule? That's marked declined." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 (owner: 10Nemo bis) [16:29:20] thcipriani: frwiki LGTM. [16:29:39] 6operations, 10ops-eqiad: db1052 degraded RAID - https://phabricator.wikimedia.org/T122703#1924301 (10Cmjohnson) Replaced the disk and it's in Firmware state: Rebuild [16:29:40] kk [16:29:48] Quite a lot of "can't connect to mysql" errors in logstash [16:29:53] (03PS2) 10Nemo bis: Remove overeager unrequested /wiki/User: robots.txt rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 [16:30:01] well, only 1 per minute, but more than usual? [16:30:43] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260964 (https://phabricator.wikimedia.org/T122433) (owner: 10Luke081515) [16:30:57] !log halt ms-be1013, required to reset idrac [16:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:18] cmjohnson1: ack! should be down in a couple of mins [16:31:36] (03Merged) 10jenkins-bot: dewikibooks: Set $wgRestrictDisplayTitle to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260964 (https://phabricator.wikimedia.org/T122433) (owner: 10Luke081515) [16:32:11] James_F: btw, thanks for sign up my changes [16:32:19] Luke081515: Thanks for doing them! [16:32:24] :) [16:33:24] (03PS3) 10Mdann52: Remove overeager unrequested /wiki/User: robots.txt rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 (owner: 10Nemo bis) [16:34:35] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: dewikibooks: Set $wgRestrictDisplayTitle to false [[gerrit:260964]] (duration: 00m 30s) [16:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:34:39] ^ James_F check please [16:36:07] thcipriani: Yup, dewikibooks working. [16:36:39] cool, thanks for checking. [16:36:43] (03CR) 10Nemo bis: "Why did you revert my PS2?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 (owner: 10Nemo bis) [16:37:20] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [16:37:27] 6operations, 10RESTBase, 6Services: Provide production jessie image with node 4.2; use this for service-runner build command - https://phabricator.wikimedia.org/T123237#1924329 (10GWicke) 3NEW [16:37:50] (03PS4) 10Nemo bis: Remove overeager unrequested /wiki/User: robots.txt rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 [16:38:03] (03Merged) 10jenkins-bot: Enable new user groups on gu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [16:38:05] (03PS1) 10Muehlenhoff: Ensure unique uidNumbers with slapo-overlay [puppet] - 10https://gerrit.wikimedia.org/r/263363 (https://phabricator.wikimedia.org/T122665) [16:39:24] (03CR) 10Jforrester: [C: 04-1] "Patch still impacts Beuntzer: pages" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 (owner: 10Nemo bis) [16:40:38] (03CR) 10Kelson: [C: 031] Add 2 sites to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) (owner: 10Mdann52) [16:40:39] 6operations, 10RESTBase, 6Services: Provide production jessie image with node 4.2; use this for service-runner build command - https://phabricator.wikimedia.org/T123237#1924342 (10GWicke) [16:41:18] (03PS5) 10Nemo bis: Remove overeager unrequested /wiki/User: robots.txt rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 [16:41:20] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable new user groups on gu.wikipedia.org [[gerrit:255810]] (duration: 00m 30s) [16:41:27] ^ James_F check please [16:41:54] (03CR) 10Jforrester: [C: 031] Remove overeager unrequested /wiki/User: robots.txt rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 (owner: 10Nemo bis) [16:41:59] thcipriani: Can you throw in https://gerrit.wikimedia.org/r/#/c/263360/ too? [16:42:16] James_F: yup, np. [16:43:17] thcipriani: guwiki LGTM. [16:43:23] kk, thanks. [16:43:32] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 (owner: 10Nemo bis) [16:45:03] 6operations, 10Beta-Cluster-Infrastructure, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1924346 (10demon) >>! In T87036#1923927, @ori wrote: > This task will be one year old this Friday. #newyearnewme [16:47:43] (03Merged) 10jenkins-bot: Remove overeager unrequested /wiki/User: robots.txt rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263360 (owner: 10Nemo bis) [16:48:17] ostriches: lol [16:48:29] :) [16:48:58] <_joe_> eheh [16:49:21] !log thcipriani@tin Synchronized robots.txt: SWAT: Remove overager unrequested /wiki/User: robots.txt rule [[gerrit:263360]] (duration: 00m 30s) [16:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:28] ^ James_F sync'd! [16:49:28] (03CR) 10Filippo Giunchedi: [C: 04-1] "minor nit in the comments, can't really comment on the code since my ruby-fu isn't strong" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [16:49:37] headed into the office now. [16:49:46] Cool. [16:54:14] thcipriani: From the SAL: "Enable global AubseFilter at French Wikipedia gerrit:257868 (duration: 00m 29s)" [16:54:20] Should I correct it? [16:54:54] Luke081515: Correct what? [16:55:21] (What should it say?) [16:55:45] "Enable global AubseFilter" looks like wrong spelling [16:55:57] abuse [16:56:15] Oh, right. Yeah, don't worry. [16:56:15] correct it to abuse? [16:56:20] Leave it. [16:56:31] ok [16:58:06] (03PS5) 10GWicke: Text VCL: Fix up logged-in users caching [puppet] - 10https://gerrit.wikimedia.org/r/259882 (owner: 10BBlack) [16:58:34] (03PS1) 10Aude: Add Wikibase-labs.php and Wikibase-production.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263370 [16:58:52] !log installing OS on db2033 [16:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:59:23] (03CR) 10Reedy: [C: 04-1] "You should add them to createTxtFileSymlinks.sh too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263370 (owner: 10Aude) [16:59:36] * aude glares at James_F for taking all the swat slots for today + tomorrow :/ [17:00:28] aude: Yeah… [17:00:39] aude: Just deploy like I do [17:00:40] * Reedy grins [17:00:47] aude, isn't it mostly stuff that should've happened weeks ago, but didn't because of the freeze? [17:01:05] (03PS2) 10Aude: Add Wikibase-labs.php and Wikibase-production.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263370 [17:01:15] Krenair: yeah [17:01:26] so you should expect swat to be completely full after such freezes [17:01:55] 6operations, 10ops-eqiad: ms-be1013 drac is not reachable via ssh - https://phabricator.wikimedia.org/T123086#1924390 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson The problem has been resolved. [17:02:18] Maybe we should expand them a bit to cater for the backlog somewhat [17:02:35] These are things that won't break the site [17:02:46] hopefully [17:02:58] (03CR) 10Hoo man: [C: 031] "Many users still use noc.wm.o, thus we should keep it up to date." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263370 (owner: 10Aude) [17:03:03] Most are minor config stuff [17:03:04] (03CR) 10Andrew Bogott: Send email to project admins if puppet runs are failing. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) (owner: 10Andrew Bogott) [17:04:14] (03PS6) 10Andrew Bogott: Send email to project admins if puppet runs are failing. [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) [17:05:05] (03CR) 10Aude: [C: 031] Set NS_MODULE as a Wikibase client NS on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263354 (https://phabricator.wikimedia.org/T123234) (owner: 10Hoo man) [17:08:58] James_F: Thanks for getting that AF patch deployed today :) [17:09:19] (03CR) 10Andrew Bogott: [C: 031] Ensure unique uidNumbers with slapo-overlay [puppet] - 10https://gerrit.wikimedia.org/r/263363 (https://phabricator.wikimedia.org/T122665) (owner: 10Muehlenhoff) [17:16:09] !log db2033 - signing puppet certs, salt-key, initial run [17:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:33] Glaisher|away: Happy to help. [17:19:30] (03PS1) 10Ema: admin: add myself (ema) to users [puppet] - 10https://gerrit.wikimedia.org/r/263378 [17:22:17] 6operations, 10ops-codfw: rack/setup/deploy auth2001 as codfw auth system - https://phabricator.wikimedia.org/T120263#1924494 (10Papaul) a:5Papaul>3MoritzMuehlenhoff The YubiHSM is plugged in the server. [17:22:41] (03PS3) 10Tim Landscheidt: Tools: Source python-socketio-client for Trusty from backports [puppet] - 10https://gerrit.wikimedia.org/r/238662 (https://phabricator.wikimedia.org/T91874) [17:23:36] (03CR) 10jenkins-bot: [V: 04-1] Tools: Source python-socketio-client for Trusty from backports [puppet] - 10https://gerrit.wikimedia.org/r/238662 (https://phabricator.wikimedia.org/T91874) (owner: 10Tim Landscheidt) [17:24:56] 6operations, 10RESTBase, 6Services: Provide production jessie image with node 4.2; use this for service-runner build command - https://phabricator.wikimedia.org/T123237#1924514 (10Pchelolo) @GWicke We already support for alternative node versions in docker build script. See T114399 (resolved by [[ https://gi... [17:26:12] 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1924516 (10EBernhardson) with holidays and all staff finally complete i wanted to see how we can move this forward [17:29:07] (03PS2) 10Ema: admin: add myself (ema) to users [puppet] - 10https://gerrit.wikimedia.org/r/263378 [17:29:18] !log Updated Wikidata's property suggester with data from today's json dump [17:29:19] 6operations, 10ops-codfw: setup/install/deploy db2033 - https://phabricator.wikimedia.org/T122998#1924543 (10Papaul) [17:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:29:44] 6operations, 10ops-codfw: setup/install/deploy db2033 - https://phabricator.wikimedia.org/T122998#1924548 (10Papaul) a:5Papaul>3jcrespo Install complete [17:31:44] (03CR) 10Dzahn: "when creating production shell users we usually look at the existing labs/wikitech/LDAP user and use the same UID for the production user." [puppet] - 10https://gerrit.wikimedia.org/r/263378 (owner: 10Ema) [17:32:05] (03PS1) 10Tim Landscheidt: apt: Remove extra space in sources.list [puppet] - 10https://gerrit.wikimedia.org/r/263380 [17:35:22] (03PS4) 10Tim Landscheidt: Tools: Source python-socketio-client for Trusty from backports [puppet] - 10https://gerrit.wikimedia.org/r/238662 (https://phabricator.wikimedia.org/T91874) [17:36:38] (03PS3) 10Ema: admin: add myself (ema) to users [puppet] - 10https://gerrit.wikimedia.org/r/263378 [17:37:05] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: puppet fail [17:38:46] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Puppet has 1 failures [17:39:36] nuria: hi, do you think we can close https://phabricator.wikimedia.org/T122524 ? amire80 says it works for him but was it the right host? [17:40:44] 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, and 2 others: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1924610 (10Nuria) I think either 1003 or 1002 work to access analytics slaves thus this ticket can be closed if things are w... [17:40:55] mutante: yes, either host will work [17:41:39] 6operations: root shell for ema - https://phabricator.wikimedia.org/T123252#1924616 (10Dzahn) 3NEW [17:41:56] 6operations: gerrit user and +2 on ops repos for ema - https://phabricator.wikimedia.org/T123253#1924622 (10Dzahn) 3NEW [17:41:58] (03CR) 10Filippo Giunchedi: "LGTM and the puppet compiler shows no changes, https://puppet-compiler.wmflabs.org/1570/ and I'm assuming it doesn't affect hiera lookups?" [puppet] - 10https://gerrit.wikimedia.org/r/260610 (owner: 10Dzahn) [17:42:02] nuria: cool, thank you [17:42:08] <_joe_> mutante: I was teaching ema how to do things [17:42:10] <_joe_> :P [17:42:43] _joe_: :) cool, i'm just making tickets for onboarding like back on RT to cover all the things [17:42:53] we didnt have a template or something yet for phab [17:43:58] 6operations: ldap/ops membership for ema - https://phabricator.wikimedia.org/T123253#1924638 (10Krenair) [17:44:21] 6operations: ldap/ops membership for ema - https://phabricator.wikimedia.org/T123253#1924622 (10Krenair) gerrit user account comes from labs, just need ldap/ops membership in LDAP to get +2 in ops repos [17:44:48] 6operations: add ema to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T123254#1924651 (10Dzahn) 3NEW [17:45:22] (03PS4) 10Ema: admin: add myself (ema) to users [puppet] - 10https://gerrit.wikimedia.org/r/263378 (https://phabricator.wikimedia.org/T123252) [17:45:46] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1924670 (10GWicke) At the dev summit, @Bianjiang of Google voiced concerns about global request rate limits & the complexity of abiding to those across several projects / teams. The 50 req/s li... [17:45:56] 6operations: add ema to ops mail aliases (exim) - https://phabricator.wikimedia.org/T123255#1924671 (10Dzahn) 3NEW [17:46:10] 6operations: add ema to ops mailing lists - https://phabricator.wikimedia.org/T123256#1924677 (10Dzahn) 3NEW [17:46:53] 6operations: add ema to icinga (contact / paging) - https://phabricator.wikimedia.org/T123257#1924686 (10Dzahn) 3NEW [17:49:36] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:50:35] 6operations: onboarding Emanuele Rocca - https://phabricator.wikimedia.org/T123089#1924697 (10Dzahn) [17:51:32] 10Ops-Access-Requests, 6operations: onboarding Emanuele Rocca - https://phabricator.wikimedia.org/T123089#1920803 (10Dzahn) [17:51:44] 6operations, 5Patch-For-Review: root shell for ema - https://phabricator.wikimedia.org/T123252#1924702 (10Joe) We need the following: - @mark's approval - @ema's signature on https://phabricator.wikimedia.org/L3 [17:53:37] 6operations, 5Patch-For-Review: root shell for ema - https://phabricator.wikimedia.org/T123252#1924706 (10mark) Approved. [17:54:54] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [17:55:38] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: root shell for ema - https://phabricator.wikimedia.org/T123252#1924710 (10Joe) [18:00:33] (03PS2) 10Dzahn: partman: replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/263144 (owner: 10Papaul) [18:00:53] (03CR) 10jenkins-bot: [V: 04-1] partman: replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/263144 (owner: 10Papaul) [18:01:13] 6operations, 10LDAP-Access-Requests: ldap/ops membership for ema - https://phabricator.wikimedia.org/T123253#1924718 (10Krenair) [18:01:15] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [18:01:48] (03CR) 10Dzahn: "(as opposed to puppet manifests) partman config might actually need the tabs, but not sure" [puppet] - 10https://gerrit.wikimedia.org/r/263144 (owner: 10Papaul) [18:03:35] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: root shell for ema - https://phabricator.wikimedia.org/T123252#1924764 (10Joe) a:3Joe [18:04:05] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:04:06] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: root shell for ema - https://phabricator.wikimedia.org/T123252#1924616 (10Joe) Since @ema already signed the document, I will proceed to grant him access. [18:04:16] (03CR) 10Papaul: "yes like others partman files. We can abandon this change for now." [puppet] - 10https://gerrit.wikimedia.org/r/263144 (owner: 10Papaul) [18:04:18] (03CR) 10Giuseppe Lavagetto: [C: 032] admin: add myself (ema) to users [puppet] - 10https://gerrit.wikimedia.org/r/263378 (https://phabricator.wikimedia.org/T123252) (owner: 10Ema) [18:05:23] (03CR) 10Dzahn: "ok, sorry about the confusion, it's still true what i said about :retab for all the .pp files" [puppet] - 10https://gerrit.wikimedia.org/r/263144 (owner: 10Papaul) [18:05:36] (03Abandoned) 10Dzahn: partman: replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/263144 (owner: 10Papaul) [18:09:22] (03PS1) 10Rush: diamond: nfsd statistics [puppet] - 10https://gerrit.wikimedia.org/r/263394 [18:09:25] 6operations, 10RESTBase, 6Services: Provide production jessie image with node 4.2; use this for service-runner build command - https://phabricator.wikimedia.org/T123237#1924790 (10mobrovac) >>! In T123237#1924514, @Pchelolo wrote: > We already support alternative node versions in docker build script. +1. Wh... [18:10:14] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: puppet fail [18:15:14] RECOVERY - check_puppetrun on payments2002 is OK: OK: Puppet is currently enabled, last run 160 seconds ago with 0 failures [18:15:14] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [18:15:15] (03CR) 10Jdlrobson: [C: 031] Enable WikidataPageBanner extension on Ukrainian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261994 (https://phabricator.wikimedia.org/T121999) (owner: 10RLuts) [18:15:20] 6operations, 6WMF-NDA-Requests: add ema to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T123254#1924799 (10Krenair) [18:15:38] 6operations, 6WMF-NDA-Requests: add ema to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T123254#1924651 (10Krenair) #acl*operations-team? [18:20:14] RECOVERY - check_puppetrun on payments2003 is OK: OK: Puppet is currently enabled, last run 224 seconds ago with 0 failures [18:21:11] (03CR) 10Alex Monk: [C: 031] Added Simple English Wikipedia as import source for English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263341 (https://phabricator.wikimedia.org/T123212) (owner: 10Pmlineditor) [18:21:54] jdlrobson, anything special needed for https://gerrit.wikimedia.org/r/#/c/261994/ ? [18:22:01] or can it just be done in a swat window without any DB changes etc.? [18:22:08] Krenair: nope can just be swatted whenever :) [18:22:15] ok, adding it to the list [18:22:31] thanks Krenair [18:22:38] (03PS2) 10RobH: new tendril.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/260784 [18:22:39] 6operations, 10RESTBase, 6Services: Provide production jessie image with node 4.2; use this for service-runner build command - https://phabricator.wikimedia.org/T123237#1924848 (10GWicke) My concern with using the nvm version is that it opens up some chance of binary dependency disagreements between the buil... [18:23:59] (03CR) 10Alex Monk: [C: 031] Enable WikidataPageBanner extension on Ukrainian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261994 (https://phabricator.wikimedia.org/T121999) (owner: 10RLuts) [18:24:01] !log tendril updating ssl cert on neon, https may flap for a second (this is on neon, so icinga https portal may also flap) [18:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, RobH [18:24:31] (03CR) 10RobH: [C: 032] new tendril.wikimedia.org certificate (renewal replacement) [puppet] - 10https://gerrit.wikimedia.org/r/260784 (owner: 10RobH) [18:28:01] 6operations, 10RESTBase, 6Services: Provide production jessie image with node 4.2; use this for service-runner build command - https://phabricator.wikimedia.org/T123237#1924863 (10mobrovac) >>! In T123237#1924848, @GWicke wrote: > My concern with using the nvm version is that it opens up some chance of binar... [18:30:30] 6operations, 10RESTBase, 6Services: Provide production jessie image with node 4.2; use this for service-runner build command - https://phabricator.wikimedia.org/T123237#1924864 (10GWicke) > 4.2.0 is marked as stable in nvm so I doubt that this will present a real issue in practice. I agree that it's unlikel... [18:30:53] !log Restarting HHVM on all job runners, to vacate memory now that the cause of the leak appears to have subsided.(T122069) [18:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:54] !log tendril cert updated and neon returned to normal service [18:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, RobH [18:37:35] 6operations, 7HTTPS: ssl certificate replacement: tendril.wikimedia.org (expires 2016-02-15) - https://phabricator.wikimedia.org/T122319#1924880 (10RobH) 5Open>3Resolved The new cert/keypair is now live. [18:38:46] ori: was that a fix for the mem issues yesterday? [18:39:11] 6operations, 7HTTPS: ssl certificate replacement: tendril.wikimedia.org (expires 2016-02-15) - https://phabricator.wikimedia.org/T122319#1924883 (10RobH) also i've added it to tracking gcal for its march 2017 expiry. [18:39:17] or whatever was killing jobqueue [18:40:21] myrcx: not a fix yet, sadly. [18:40:59] ori: any ideas? Timeouts everywhere yesterday :/ [18:41:57] myrcx: what do you mean? where did you encounter timeouts? [18:43:24] No no, sorry I meant on the logs here, mw1015/1008 RAID timeouts etc [18:44:53] myrcx: so, two graphs to tell the story thus far: [18:45:23] mw116* job runners memory usage, last 14 days: http://graphite.wikimedia.org/render/?width=969&height=556&_salt=1452537911.289&from=-14days&target=servers.mw116*.memory.Active&lineMode=connected [18:45:44] same, last 12 hours: http://graphite.wikimedia.org/render/?width=969&height=556&_salt=1452537934.487&from=-12hours&target=servers.mw116*.memory.Active&lineMode=connected [18:46:02] (03PS8) 10Thcipriani: Puppet provider for scap3 [puppet] - 10https://gerrit.wikimedia.org/r/262742 (https://phabricator.wikimedia.org/T113072) (owner: 10Alexandros Kosiaris) [18:46:15] there were no code deployments during the past two weeks, and yet the situation got progressively worse, and then suddenly appears to have subsided [18:46:54] weird :/ 1166 sat at a steady 50GB [18:47:00] if the problem can come and go without code changes, it must be related to the state of the queue [18:47:07] the queue size jumped though [18:47:10] which makes me suspect that a particular job type is responsible [18:48:02] there was SWAT 2h before the drop though [18:48:29] oh, was there? I missed that. Let me take a look at what went out, and see if it is plausibly related. [18:48:38] don't normally look at https://grafana.wikimedia.org/dashboard/db/job-queue-health but that's a pretty steady climb - is that normal? [18:48:40] The PHP code is not exactly the same, maybe HHVM rebuilt something [18:49:01] The period you mentioned also coincides with the previous SWAT [18:49:16] (2015-12-21) [18:49:30] well maybe not :) didn't check other PHP changes [18:49:41] I think you might be on to something, Nemo_bis [18:50:56] i'll write up my thoughts in a comment on https://phabricator.wikimedia.org/T122069 [18:51:41] be interesting to hear what you think ori :) [18:52:56] !log rt.w.o cert expired and its replacement will be later today (rt is internal ops only tool) [18:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, RobH [18:56:05] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [3000.0] [18:56:15] Nemo_bis: two hours before which drop? memory growth flattened at 11-Jan-2016 06:26 UTC AFAICT [18:59:20] 6operations, 6Analytics-Kanban, 7HTTPS: EventLogging sees too few distinct client IPs {oryx} [8 pts] - https://phabricator.wikimedia.org/T119144#1924970 (10madhuvishy) [18:59:22] 6operations, 6Analytics-Kanban, 7HTTPS: EventLogging sees too few distinct client IPs [8] - https://phabricator.wikimedia.org/T119144#1924971 (10Nuria) [19:06:01] ori: dunno, I see many lines going down at 18:30 in https://graphite.wikimedia.org/render/?width=969&height=556&_salt=1452537934.487&from=-12hours&target=servers.mw116*.memory.Active&lineMode=connected [19:08:17] 6operations, 6Analytics-Kanban, 7HTTPS: EventLogging sees too few distinct client IPs {oryx} [8 pts] - https://phabricator.wikimedia.org/T119144#1924985 (10madhuvishy) [19:20:23] (03CR) 10Krinkle: Make MediaWiki treat $lang of be_x_oldwiki as be-tarask, just don't change the real DB name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236966 (https://phabricator.wikimedia.org/T111853) (owner: 10Alex Monk) [19:27:31] (03PS2) 10Rush: diamond: nfsd statistics [puppet] - 10https://gerrit.wikimedia.org/r/263394 [19:29:00] (03CR) 10Andrew Bogott: [C: 031] diamond: nfsd statistics [puppet] - 10https://gerrit.wikimedia.org/r/263394 (owner: 10Rush) [19:29:36] (03CR) 10Rush: [C: 032] diamond: nfsd statistics [puppet] - 10https://gerrit.wikimedia.org/r/263394 (owner: 10Rush) [19:30:23] 6operations, 6Performance-Team, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1925046 (10ori) == What we know == * The rate of memory growth after restarting HHVM has been increasing over the past two weeks: {F3219173} * Memory grow... [19:31:05] PROBLEM - puppet last run on labstore1001 is CRITICAL: CRITICAL: Puppet last ran 3 days ago [19:32:25] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [19:32:41] ori: +1 - thanks for the update :) [19:33:15] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:36:28] (03PS1) 10RobH: rt.wikimedia.org new certificate [puppet] - 10https://gerrit.wikimedia.org/r/263403 [19:36:50] (03CR) 10RobH: [C: 032] rt.wikimedia.org new certificate [puppet] - 10https://gerrit.wikimedia.org/r/263403 (owner: 10RobH) [19:40:25] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [19:41:45] RECOVERY - HTTPS on magnesium is OK: SSL OK - Certificate rt.wikimedia.org valid until 2017-01-11 19:26:07 +0000 (expires in 365 days) [19:46:12] 6operations, 7Mail: Exim Alias Remove - ask - https://phabricator.wikimedia.org/T123274#1925073 (10JKrauska) 3NEW [19:47:52] 6operations, 7Mail: Exim Alias Remove - ask - https://phabricator.wikimedia.org/T123274#1925082 (10JKrauska) [19:47:54] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1925081 (10JKrauska) [19:48:03] 6operations, 7Mail: Exim Alias Remove - ask - https://phabricator.wikimedia.org/T123274#1925073 (10JKrauska) [19:49:07] 6operations, 7Mail: Exim Alias Remove - ask - https://phabricator.wikimedia.org/T123274#1925086 (10Dzahn) a:3Dzahn [19:49:36] 6operations, 7Mail: Exim Alias Remove - ask - https://phabricator.wikimedia.org/T123274#1925073 (10Dzahn) @JKrauska ok, cool, i'm taking them [19:50:42] 6operations, 7Mail: Remove exim alias - yuvipanda - https://phabricator.wikimedia.org/T123275#1925091 (10JKrauska) 3NEW [19:51:09] 6operations, 7Mail: Exim Alias Remove - ask - https://phabricator.wikimedia.org/T123274#1925100 (10Dzahn) @JKrauska what do you think about the Jimmy alias? can we do that one too? (T122927) [19:51:36] (03CR) 10Yuvipanda: Send email to project admins if puppet runs are failing. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) (owner: 10Andrew Bogott) [19:53:20] (03PS2) 10Dzahn: partman:Changed lv_name from root to srv [puppet] - 10https://gerrit.wikimedia.org/r/263158 (owner: 10Papaul) [19:53:33] (03PS3) 10Dzahn: partman:Changed lv_name from root to srv [puppet] - 10https://gerrit.wikimedia.org/r/263158 (owner: 10Papaul) [19:53:52] (03CR) 10Dzahn: [C: 032] "this recipe only used by new pc20xx servers" [puppet] - 10https://gerrit.wikimedia.org/r/263158 (owner: 10Papaul) [19:55:14] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [19:55:57] (03PS1) 10Merlijn van Deen: Revert "toollabs: install hunspell, python*-hunspell, hunspell-dictionary to exec nodes" [puppet] - 10https://gerrit.wikimedia.org/r/263404 [19:57:19] (03PS2) 10Merlijn van Deen: Revert "toollabs: install hunspell, python*-hunspell, hunspell-dictionary to exec nodes" [puppet] - 10https://gerrit.wikimedia.org/r/263404 (https://phabricator.wikimedia.org/T123192) [19:57:25] YuviPanda: ^ +2? [19:58:07] (03PS3) 10Yuvipanda: Revert "toollabs: install hunspell, python*-hunspell, hunspell-dictionary to exec nodes" [puppet] - 10https://gerrit.wikimedia.org/r/263404 (https://phabricator.wikimedia.org/T123192) (owner: 10Merlijn van Deen) [19:58:15] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "toollabs: install hunspell, python*-hunspell, hunspell-dictionary to exec nodes" [puppet] - 10https://gerrit.wikimedia.org/r/263404 (https://phabricator.wikimedia.org/T123192) (owner: 10Merlijn van Deen) [19:58:21] (03PS3) 10Dzahn: base: fix missing whitespaces in check_conntrack.py [puppet] - 10https://gerrit.wikimedia.org/r/262593 (owner: 10Hashar) [20:00:11] (03CR) 10Dzahn: "applied on carbon" [puppet] - 10https://gerrit.wikimedia.org/r/263158 (owner: 10Papaul) [20:00:23] (03CR) 10Dzahn: [C: 032] base: fix missing whitespaces in check_conntrack.py [puppet] - 10https://gerrit.wikimedia.org/r/262593 (owner: 10Hashar) [20:01:44] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [20:03:35] (03PS5) 10Dzahn: Fix wikidata redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [20:13:39] 6operations, 6Performance-Team: Define SLAs for media - https://phabricator.wikimedia.org/T112692#1925162 (10Gilles) [20:19:07] (03CR) 10jenkins-bot: [V: 04-1] Rename all main WikimediaIncubator settings to have a wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 (owner: 10Paladox) [20:24:00] (03CR) 10Dzahn: "before:" [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [20:24:20] (03CR) 10Dzahn: [C: 031] "deployed on mw1017 (canary)" [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [20:25:06] (03CR) 10JanZerebecki: [C: 031] "Manual request with X-Wikimedia-Debug: 1 looks good." [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [20:25:56] (03PS7) 10Andrew Bogott: Send email to project admins if puppet runs are failing. [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) [20:30:05] (03PS2) 10JanZerebecki: Fix api redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255150 [20:32:11] 6operations, 7Mail: Exim Alias Remove - ask - https://phabricator.wikimedia.org/T123274#1925244 (10Dzahn) 5Open>3Resolved done. the ask@ alias has been deactivated now. [20:32:12] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1925246 (10Dzahn) [20:32:21] (03PS3) 10JanZerebecki: Fix api redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255150 [20:32:23] (03PS1) 10ArielGlenn: dumps: checkpointing for de and fr wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/263411 (https://phabricator.wikimedia.org/T116907) [20:32:39] 6operations, 7Mail: Remove exim alias - yuvipanda - https://phabricator.wikimedia.org/T123275#1925256 (10Dzahn) [20:32:45] 6operations, 7Mail: Remove exim alias - yuvipanda - https://phabricator.wikimedia.org/T123275#1925258 (10Dzahn) a:3Dzahn [20:35:13] (03CR) 10Hashar: "Added labs root as reviewers :)" [puppet] - 10https://gerrit.wikimedia.org/r/262596 (owner: 10Hashar) [20:35:55] (03CR) 10Hashar: [C: 031] "I rebased it to get rid of the parent dependency" [puppet] - 10https://gerrit.wikimedia.org/r/262597 (owner: 10Hashar) [20:36:03] (03CR) 10ArielGlenn: [C: 032] dumps: checkpointing for de and fr wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/263411 (https://phabricator.wikimedia.org/T116907) (owner: 10ArielGlenn) [20:37:12] (03PS3) 10Andrew Bogott: toollabs: lint genpp.py [puppet] - 10https://gerrit.wikimedia.org/r/262596 (owner: 10Hashar) [20:39:48] (03CR) 10Andrew Bogott: [C: 032] toollabs: lint genpp.py [puppet] - 10https://gerrit.wikimedia.org/r/262596 (owner: 10Hashar) [20:39:56] 6operations, 6Performance-Team, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Record per-job-type memory usage statistics - https://phabricator.wikimedia.org/T123284#1925277 (10ori) 3NEW a:3aaron [20:40:34] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [20:45:45] (03PS1) 10Luke081515: Add exceptions for eswiki, eswikivoyage at 2016-01-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263414 (https://phabricator.wikimedia.org/T123261) [20:46:14] (03PS1) 10Krinkle: Consistently use require_once for MWVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263415 [20:49:31] 6operations, 7Mail: remove gbyrd from exim alias file - https://phabricator.wikimedia.org/T123285#1925298 (10Dzahn) 3NEW [20:50:37] (03PS8) 10Andrew Bogott: Send email to project admins if puppet runs are failing. [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) [20:51:06] 6operations, 7Mail: remove or update ea@ alias? - https://phabricator.wikimedia.org/T123286#1925304 (10Dzahn) 3NEW [20:51:23] 6operations, 7Mail: remove or update ea@ alias? - https://phabricator.wikimedia.org/T123286#1925304 (10Dzahn) [20:51:28] (03CR) 10JanZerebecki: "Before:" [puppet] - 10https://gerrit.wikimedia.org/r/255150 (owner: 10JanZerebecki) [20:52:04] (03CR) 10Andrew Bogott: [C: 032] Send email to project admins if puppet runs are failing. [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) (owner: 10Andrew Bogott) [20:52:14] (03PS6) 10Dzahn: Fix wikidata redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [20:53:12] 6operations, 6WMF-NDA-Requests: add ema to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T123254#1925315 (10Dzahn) adding @RobH [20:53:56] 6operations, 6WMF-NDA-Requests: add ema to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T123254#1925321 (10RobH) Anyone in the acl*operations-team can add someone else to the group. It doesn't need to be me. [20:54:12] mutante: ^ you should be able to add them [20:54:19] if you cannot, something isnt set right and let me know [20:54:35] 6operations, 6WMF-NDA-Requests: add ema to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T123254#1925329 (10Dzahn) i was asking if the group is correct since i wasn't involved in the switch to acl groups [20:54:38] (i didnt add that second part to my reply, doing so in edit but it wont echo in here) [20:55:35] 6operations, 6WMF-NDA-Requests: add ema to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T123254#1925330 (10RobH) Oh, correct. Addition to the #acl*operations-team is what allows access for operatoins restricted spaces (S4 #procurement) and #ops-access-reviews, etc.... [20:56:27] RobH: thanks, it was also for ema himself to see how it works [20:56:32] mutante: fyi: since the group now exists as an acl, i dont subscribe to the #operations project specifically anymore and my alerts (for the specific sub-projects I work on) are now actually useful =] [20:56:33] and i wasnt sure [20:56:46] I'd recommend the same to the new opsen =] [20:56:48] alright [20:57:06] (the #operations project as a whole is important for snapshots of overall tracking but less useful to attempt to follow imo) [20:59:21] 6operations: add ema to icinga (contact / paging) - https://phabricator.wikimedia.org/T123257#1925338 (10Dzahn) the private puppet repository with the icinga contacts is on the puppetmaster, palladium.eqiad.wmnet in `modules/secret/secrets/nagios/contacts.cfg` git commit locally when making changes but it does... [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160111T2100). Please do the needful. [21:00:22] 6operations, 6WMF-NDA-Requests: add ema to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T123254#1925340 (10Dzahn) thanks. added ema to https://phabricator.wikimedia.org/project/members/29/ [21:00:28] no mobileapps deploy today [21:02:05] !log starting parsoid deploy [21:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:03:38] 6operations, 6WMF-NDA-Requests: add ema to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T123254#1925352 (10Dzahn) also added ema to https://phabricator.wikimedia.org/project/members/61/ [21:05:43] 10Ops-Access-Requests, 6operations: onboarding Emanuele Rocca - https://phabricator.wikimedia.org/T123089#1925356 (10Dzahn) [21:05:44] 6operations, 6WMF-NDA-Requests: add ema to WMF-NDA and ops phabricator groups - https://phabricator.wikimedia.org/T123254#1925354 (10Dzahn) 5Open>3Resolved a:3Dzahn [21:06:17] !log synced new code; restarted parsoid on wtp1003 as a canary [21:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:31] !log installing OS on pc200[4-6] [21:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:50] 6operations: add ema to ops mail aliases (exim) - https://phabricator.wikimedia.org/T123255#1925359 (10Dzahn) for this, see` /root/private/modules/privateexim/files/wikimedia.org` on palladium.eqiad.wmnet [21:07:21] 6operations: add ema to ops mail aliases (exim) - https://phabricator.wikimedia.org/T123255#1925362 (10Dzahn) [21:07:54] (03PS1) 10Alex Monk: scap: Add wikishared (x1) support to sql command [puppet] - 10https://gerrit.wikimedia.org/r/263419 (https://phabricator.wikimedia.org/T123217) [21:08:11] 6operations, 10ops-codfw: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1925371 (10Papaul) [21:09:02] 6operations: add ema to ops mailing lists - https://phabricator.wikimedia.org/T123256#1925372 (10Dzahn) https://lists.wikimedia.org/mailman/listinfo/ops [21:09:31] 7Blocked-on-Operations, 6operations, 10RESTBase: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1925375 (10GWicke) p:5Normal>3High [21:09:41] (03CR) 10Dzahn: [C: 032] Fix wikidata redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [21:10:05] hoo: i'm considering deploying https://gerrit.wikimedia.org/r/#/c/253898/ after the services deploy is done. is that ok? will you be available if something goes wrong? [21:10:55] That means you'll start in 1h? [21:11:07] well, 50m [21:11:48] (03CR) 10Alex Monk: [C: 04-1] "might have a better way to do this" [puppet] - 10https://gerrit.wikimedia.org/r/263419 (https://phabricator.wikimedia.org/T123217) (owner: 10Alex Monk) [21:11:55] hoo: y [21:12:38] Don't want to be around post 12 [21:12:49] If you could start a little earlier, that would be nice [21:13:11] !log finished deploying parsoid sha 07494cf2 [21:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:26] Service deploys are often finished earlier tahn planned [21:14:01] subbu: is the services deploy finished or is another one happening? [21:14:41] jzerebecki, gwicke said that they aren't deploying anything today. [21:14:42] parsoid is done. [21:14:58] greg-g: are you fine if I deploy a mediawiki backport https://gerrit.wikimedia.org/r/#/c/253898/ after ther serveries one is done? [21:15:00] we are going to be doing some tests and verifications. [21:15:15] subbu: ok thx [21:16:19] jzerebecki: are you comfortable dealing with the dispatchChanges locking/etc during deploy? [21:18:20] greg-g: yes I plan to null the dispatchChanges maintainance script and kill the currently running ones to ensure it is not starting or running during deployment [21:19:01] !log pc200[4-6] - signing puppet certs, salt-key, initial run [21:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:21] papaul: :) [21:19:29] yes [21:20:03] yay @ partman work [21:22:10] papaul: congrats!!! [21:22:22] thanks [21:22:23] i mean, we thought it was going to work, but until it does.... [21:22:44] cmjohnson1: ^ the partman for the new pc systems is done [21:22:54] cool [21:22:56] thx [21:23:21] papaul: now you are the last opsen to touch partman recipes, if folks have questions they'll come to you ;] [21:23:32] lol [21:25:28] jzerebecki: kk, then yes [21:26:39] (03PS1) 10Luke081515: Throttle exceptionat commonswiki/hewiki on 2016-01-12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263427 (https://phabricator.wikimedia.org/T123161) [21:28:34] (03CR) 10Luke081515: "I don't removed the old throttle, because I don't want merge conlicts with https://gerrit.wikimedia.org/r/#/c/263414/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263427 (https://phabricator.wikimedia.org/T123161) (owner: 10Luke081515) [21:32:06] (03PS1) 10Andrew Bogott: Horizon: remove 'dashboards' and 'default_dashboard' settings. [puppet] - 10https://gerrit.wikimedia.org/r/263432 [21:33:21] (03PS4) 10Dzahn: Fix api redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255150 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [21:35:45] (03CR) 10Samtar: [C: 031] "All good here" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263427 (https://phabricator.wikimedia.org/T123161) (owner: 10Luke081515) [21:35:53] (03PS1) 10Tim Landscheidt: Tools: Remove obsolete variable $repo [puppet] - 10https://gerrit.wikimedia.org/r/263437 [21:37:45] 6operations, 10ops-codfw: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1925480 (10Papaul) [21:39:07] 6operations, 10ops-codfw: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1925481 (10Papaul) a:5Papaul>3jcrespo Installation complete. [21:39:50] 6operations, 10Traffic, 7Mobile, 7Varnish: Static image files from en.m.wikipedia.org are served with cache-suppressing headers - https://phabricator.wikimedia.org/T86993#1925493 (10Jdlrobson) [21:39:57] 6operations, 10Traffic, 7Mobile, 7Varnish: Static image files from en.m.wikipedia.org are served with cache-suppressing headers - https://phabricator.wikimedia.org/T86993#981163 (10Jdlrobson) [21:41:19] 6operations, 7Mail: Remove exim alias - yuvipanda - https://phabricator.wikimedia.org/T123275#1925500 (10Dzahn) @yuvipanda ^ yuvipanda is used as the destination in other places in the alias file, like as a member of root@ and stuff like that. so that would have to be changed too (?) and Yuvipanda said he d... [21:44:31] mutante: do you know approximately how long it takes for a new system to show up on icinga ? [21:46:09] papaul: I believe it's puppet has to run on the affected host, then the master, then neon [21:46:16] something like that so however long for that weird stagger [21:46:21] I generally just reach out and do it [21:46:39] chasemp: thanks [21:47:06] papaul: one puppet run on the system itself and one on the icinga servers is needed [21:47:12] should be max 1h [21:47:28] (03PS2) 10Luke081515: Throttle exception at commonswiki/hewiki on 2016-01-12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263427 (https://phabricator.wikimedia.org/T123161) [21:47:31] mutante: thanks [21:50:49] (03PS2) 10Alex Monk: scap: Add wikishared (x1) support to sql command [puppet] - 10https://gerrit.wikimedia.org/r/263419 (https://phabricator.wikimedia.org/T123217) [21:55:05] papaul: reload now. i ran puppet on neon [21:55:19] mutante: thanks [21:55:47] mutante: in pending mode [21:55:58] mutante: thanks [21:56:09] papaul: that should change in liek 5 minutes [21:56:20] but it means puppet added the config [21:56:29] mutante: yep [21:57:59] anyone bored and wanna deploy a simple config change so an in-person editing session can happen tomorrow (in Haifa)? [21:58:02] https://gerrit.wikimedia.org/r/#/c/263427/ [21:58:04] :) [22:00:07] (03CR) 10Hashar: [C: 031] "It is 11pm for me right now so too late to deploy. But this is all fine as far as I can tell." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263427 (https://phabricator.wikimedia.org/T123161) (owner: 10Luke081515) [22:03:20] 6operations, 10ops-codfw: ms-be2007 - System halted!Error: Integrated RAID - https://phabricator.wikimedia.org/T122844#1925580 (10Papaul) a:3Papaul [22:04:48] greg-g: i'm waiting on jenkins for my deploy, so I can do that first, after merge I only need to scap that one file, right? [22:05:01] greg-g: and you will test afterwards? [22:06:49] (03CR) 10Alex Monk: [C: 031] Add exceptions for eswiki, eswikivoyage at 2016-01-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263414 (https://phabricator.wikimedia.org/T123261) (owner: 10Luke081515) [22:09:18] jzerebecki: right [22:10:45] * aude would like to see wikidata as part of the editing session, but ok.... ;) [22:14:04] jzerebecki: can't really test it without being in or spoofing that IP :) [22:14:54] hoo_, aude: it seems Wikidata in core wmf9 is on an unexpected commit, that is not in wikidata wmf8 [22:14:57] :( [22:15:15] Sounds fun [22:15:43] yes a commit from master was used :( [22:15:53] wow [22:16:37] $wgWBClientSettings["sharedCacheKeyPrefix"] = "wikibase_shared/_mediawiki_extension_wikidata"; [22:16:50] yeah, that's pretty obviously not what should be where [22:16:53] hoo: I think I will create a wikidata wmf9 branch from that commit [22:17:04] Yeah, fork wmf8 [22:17:31] PROBLEM - DPKG on californium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:17:35] we should also consider going away from the branch pick magic stuff and just make sure we got it right ourselves [22:17:45] or even at some point go back to the normal weekly schedule [22:19:21] jzerebecki: ugh :( [22:19:33] RECOVERY - DPKG on californium is OK: All packages OK [22:20:44] aude: jzerebecki: Guess we should revert 731742221ac2b19077265025564db11d35d17e6e [22:22:35] (03PS1) 10ArielGlenn: dumps: mark partial dumps properly (was marking them as done) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/263524 [22:23:37] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: mark partial dumps properly (was marking them as done) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/263524 (owner: 10ArielGlenn) [22:27:21] jzerebecki: just to be explicit: yeah, that'd be really nice of you to deploy that throttle exception, thank you. [22:28:10] greg-g: after sorting out the wikidata branching foo and deploying that change I originally intended... [22:28:21] hoo: please check https://gerrit.wikimedia.org/r/#/c/263527/ [22:29:00] Will look [22:30:18] jzerebecki: right right :) [22:31:48] jzerebecki: Diff looks ok, guess that's the least bad we can do right now [22:32:03] the cache keys are screwed, but changing them now wouldn't be nice either [22:32:17] so, this only touches the dispatching, good to go IMO [22:33:33] restbase servers have puppet breakage [22:33:46] nodejs package related [22:33:56] puppet-duct-tape --force [22:36:48] (03PS2) 10ArielGlenn: puppetize dumps monitor as a service [puppet] - 10https://gerrit.wikimedia.org/r/257560 (https://phabricator.wikimedia.org/T110888) [22:38:05] !log jzerebecki@tin Synchronized php-1.27.0-wmf.9/extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php: truncating Wikidata dispatchChanges.php to stop dispatchers as preparation for https://gerrit.wikimedia.org/r/#/c/253898/ (duration: 00m 31s) [22:38:09] 6operations, 10RESTBase, 7RESTBase-architecture: restbase - nodejs package upgrade - puppet fail - https://phabricator.wikimedia.org/T123297#1925655 (10Dzahn) 3NEW [22:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:39:08] 6operations, 10RESTBase, 7RESTBase-architecture: restbase - nodejs package upgrade - puppet fail - https://phabricator.wikimedia.org/T123297#1925671 (10Dzahn) p:5Triage>3High ``` root@restbase1001:~# puppet agent -tv | grep Warning Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-c... [22:40:09] 6operations, 10RESTBase, 7RESTBase-architecture: restbase - nodejs package upgrade - puppet fail - https://phabricator.wikimedia.org/T123297#1925674 (10Dzahn) ``` hi nodejs 0.10.29~dfsg-2 amd64 evented I/O for V8 javascript ii nodejs-dev 0.10.2... [22:40:23] (03CR) 10Yuvipanda: [C: 032] Tools: Remove obsolete variable $repo [puppet] - 10https://gerrit.wikimedia.org/r/263437 (owner: 10Tim Landscheidt) [22:40:24] !log dispatchChanges.php killed on terbium [22:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:40:37] !log restbase1001 - apt-get install nodejs [22:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:41:31] (03PS3) 10ArielGlenn: puppetize dumps monitor as a service [puppet] - 10https://gerrit.wikimedia.org/r/257560 (https://phabricator.wikimedia.org/T110888) [22:41:43] RECOVERY - DPKG on restbase1001 is OK: All packages OK [22:42:08] 6operations, 10RESTBase, 7RESTBase-architecture: restbase - nodejs package upgrade - puppet fail - https://phabricator.wikimedia.org/T123297#1925677 (10Dzahn) i manually ran `apt-get install nodejs` and ran puppet again. the error disappeared. was this update triggered by an ensure => latest or a human ? [22:42:32] RECOVERY - puppet last run on restbase1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:43:56] (03PS1) 10Yuvipanda: eventlogging: Add timestamp to syncing script's logs [puppet] - 10https://gerrit.wikimedia.org/r/263531 [22:44:05] where is jenkins when I need it [22:44:22] RECOVERY - DPKG on restbase2002 is OK: All packages OK [22:44:31] RECOVERY - DPKG on restbase2005 is OK: All packages OK [22:44:32] RECOVERY - RAID on db1052 is OK: OK: optimal, 1 logical, 2 physical [22:44:42] RECOVERY - puppet last run on restbase2005 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:44:47] (03CR) 10ArielGlenn: [C: 032] puppetize dumps monitor as a service [puppet] - 10https://gerrit.wikimedia.org/r/257560 (https://phabricator.wikimedia.org/T110888) (owner: 10ArielGlenn) [22:45:18] !log jzerebecki@tin Synchronized php-1.27.0-wmf.9/extensions/Wikidata/extensions/Wikibase/repo: deploying https://gerrit.wikimedia.org/r/#/c/253898/ with dispatchChanges.php still truncated (duration: 00m 33s) [22:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:45:29] diff --git a/modules/toollabs/manifests/init.pp b/modules/toollabs/manifests/init.pp [22:45:31] RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [22:45:37] I merged this ( YuviPanda? ) [22:45:43] apergos: oh yeah, thanks [22:45:46] yw [22:46:50] !log restbase1004, restbase2002, restbase2005 - manually install nodejs [22:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:47:02] !log jzerebecki@tin Synchronized php-1.27.0-wmf.9/extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php: restoring truncated Wikidata dispatchChanges.php to let dispatchers run again (duration: 00m 30s) [22:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:48:18] !log restart eventlogging_synch on dbstore1002 [22:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:49:48] (03CR) 10JanZerebecki: [C: 032] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263427 (https://phabricator.wikimedia.org/T123161) (owner: 10Luke081515) [22:50:03] RECOVERY - DPKG on restbase1004 is OK: All packages OK [22:50:24] last restbase [22:50:25] (03Merged) 10jenkins-bot: Throttle exception at commonswiki/hewiki on 2016-01-12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263427 (https://phabricator.wikimedia.org/T123161) (owner: 10Luke081515) [22:50:43] 6operations, 10Gitblit-Deprecate, 6Phabricator, 6Repository-Admins, and 2 others: Make @catrope a repository administrator in Phabricator - https://phabricator.wikimedia.org/T836#1925795 (10greg) [22:50:48] 6operations, 10RESTBase, 7RESTBase-architecture: restbase - nodejs package upgrade - puppet fail - https://phabricator.wikimedia.org/T123297#1925797 (10Dzahn) 14:45 < icinga-wm> RECOVERY - DPKG on restbase1001 is OK: All packages OK ii nodejs 4.2.4~dfsg-1~bpo8+1 amd64... [22:51:32] RECOVERY - puppet last run on restbase1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:51:48] (03PS2) 10Yuvipanda: eventlogging: Add timestamp to syncing script's logs [puppet] - 10https://gerrit.wikimedia.org/r/263531 [22:51:59] can someone spot check https://gerrit.wikimedia.org/r/#/c/263531/ [22:52:20] mutante: ^ [22:52:33] !log jzerebecki@tin Synchronized wmf-config/throttle.php: deploying https://gerrit.wikimedia.org/r/#/c/263427/ (duration: 00m 30s) [22:52:35] sorry, i'm busy fixing restbase [22:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:52:44] morebots: ok [22:52:44] I am a logbot running on tools-exec-1214. [22:52:44] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [22:52:44] To log a message, type !log . [22:52:47] i dont know what triggered that upgrade [22:52:47] err [22:53:00] (03CR) 10Yuvipanda: [C: 032 V: 032] eventlogging: Add timestamp to syncing script's logs [puppet] - 10https://gerrit.wikimedia.org/r/263531 (owner: 10Yuvipanda) [22:53:04] greg-g: https://gerrit.wikimedia.org/r/#/c/263427/ is done [22:53:21] jzerebecki: thanks! [22:54:22] hoo: wikidatawiki dispatchers are already running again... [22:54:29] godog: https://phabricator.wikimedia.org/T123297 [22:55:27] Looks good at a glance :) [22:55:52] Guess we'll see whether it really works around the next edit peak [22:57:03] 6operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Labs-Infrastructure, and 2 others: Nodepool deadlocks when querying unresponsive OpenStack API (was: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins) - https://phabricator.wikimedia.org/T122731#1925848 (10gr... [22:57:15] 6operations, 5Continuous-Integration-Scaling, 7WorkType-NewFunctionality: Upload new Zuul packages on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T118340#1925855 (10greg) [22:57:43] jzerebecki: Can we now revert that change or find some other way to make sure the right version gets deployed? [22:57:47] (03PS1) 10ArielGlenn: dumps: add role class for dumps monitor [puppet] - 10https://gerrit.wikimedia.org/r/263533 [22:58:13] 6operations, 10RESTBase, 7RESTBase-architecture: restbase - nodejs package upgrade - puppet fail - https://phabricator.wikimedia.org/T123297#1925921 (10Dzahn) [23:00:51] (03CR) 10ArielGlenn: [C: 032] dumps: add role class for dumps monitor [puppet] - 10https://gerrit.wikimedia.org/r/263533 (owner: 10ArielGlenn) [23:00:56] hoo: yea i'm uploading a change [23:01:33] 6operations, 10RESTBase, 7RESTBase-architecture: restbase - nodejs package upgrade - puppet fail - https://phabricator.wikimedia.org/T123297#1925948 (10Dzahn) p:5High>3Normal lowering priority because manual fix has been applied [23:01:56] thanks :) [23:06:15] (03PS1) 10ArielGlenn: dumps: apply dumps monitor role to snapshot1004 [puppet] - 10https://gerrit.wikimedia.org/r/263536 [23:07:02] (03CR) 10Andrew Bogott: [C: 032] Horizon: remove 'dashboards' and 'default_dashboard' settings. [puppet] - 10https://gerrit.wikimedia.org/r/263432 (owner: 10Andrew Bogott) [23:07:12] (03PS2) 10Andrew Bogott: Horizon: remove 'dashboards' and 'default_dashboard' settings. [puppet] - 10https://gerrit.wikimedia.org/r/263432 [23:08:11] (03CR) 10ArielGlenn: [C: 032] dumps: apply dumps monitor role to snapshot1004 [puppet] - 10https://gerrit.wikimedia.org/r/263536 (owner: 10ArielGlenn) [23:08:15] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1926005 (10CaitVirtue) Adding myself as a subscriber. Please give Sam access to the alias ASAP. I'm happy to talk with Ops/OIT about a better way to handle this alias in the future. Anyone interested can schedule time with me st... [23:08:45] (03PS3) 10Andrew Bogott: Horizon: remove 'dashboards' and 'default_dashboard' settings. [puppet] - 10https://gerrit.wikimedia.org/r/263432 [23:09:55] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1926021 (10eliza) Hi there, Looks like a Google Group has been created for this. Samantha will now be receiving messages as well. Ops -you could close this ticket. Thank you, Eliza [23:14:52] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: puppet fail [23:20:22] (03PS1) 10ArielGlenn: dumps: declare monitor class with full name, duh [puppet] - 10https://gerrit.wikimedia.org/r/263540 [23:21:23] (03CR) 10ArielGlenn: [C: 032] dumps: declare monitor class with full name, duh [puppet] - 10https://gerrit.wikimedia.org/r/263540 (owner: 10ArielGlenn) [23:25:04] 6operations, 6Analytics-Backlog, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: schedule a daily run of ContentTranslation analytics scripts on terbium - https://phabricator.wikimedia.org/T122479#1926062 (10Dzahn) [23:25:08] 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, and 2 others: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1926059 (10Dzahn) 5Open>3Resolved a:3Dzahn thanks for the confirmation. resolving [23:25:12] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:25:30] 10Ops-Access-Requests, 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: access for amire80 to stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T122524#1926063 (10Dzahn) [23:27:26] (03PS1) 10ArielGlenn: dumps: add config file arg to dumps monitor startup scripts [puppet] - 10https://gerrit.wikimedia.org/r/263541 [23:29:07] (03CR) 10ArielGlenn: [C: 032] dumps: add config file arg to dumps monitor startup scripts [puppet] - 10https://gerrit.wikimedia.org/r/263541 (owner: 10ArielGlenn) [23:31:32] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a branch - https://phabricator.wikimedia.org/T99096#1926077 (10mmodell) >>! In T99096#1920989, @Krinkle wrote: > The alternative idea (of collecting files in a centr... [23:33:11] (03PS5) 10JanZerebecki: Fix api redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255150 (https://phabricator.wikimedia.org/T119532) [23:33:54] (03PS6) 10JanZerebecki: Fix api redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255150 (https://phabricator.wikimedia.org/T119532) [23:39:12] PROBLEM - puppet last run on mw1108 is CRITICAL: CRITICAL: Puppet has 1 failures [23:39:32] (03PS1) 10ArielGlenn: dumps: don't add config file in $HOME if there is no such value [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/263544 [23:39:57] (03PS1) 10Milimetric: Blacklist MobileWebSectionUsage from mysql [puppet] - 10https://gerrit.wikimedia.org/r/263545 [23:40:32] (03CR) 10Nuria: [C: 031] Blacklist MobileWebSectionUsage from mysql [puppet] - 10https://gerrit.wikimedia.org/r/263545 (owner: 10Milimetric) [23:41:19] 6operations, 7Mail: URGENT - remove jimmy@ alias from exim mail aliases - https://phabricator.wikimedia.org/T123315#1926094 (10JKrauska) 3NEW [23:42:13] (03CR) 10Dzahn: [C: 04-1] "before:" [puppet] - 10https://gerrit.wikimedia.org/r/255150 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [23:43:14] 6operations, 7Mail: URGENT - remove jimmy@ alias from exim mail aliases - https://phabricator.wikimedia.org/T123315#1926104 (10Dzahn) a:3Dzahn [23:44:19] (03PS2) 10Yuvipanda: Blacklist MobileWebSectionUsage from mysql [puppet] - 10https://gerrit.wikimedia.org/r/263545 (owner: 10Milimetric) [23:44:26] (03CR) 10Yuvipanda: [C: 032 V: 032] Blacklist MobileWebSectionUsage from mysql [puppet] - 10https://gerrit.wikimedia.org/r/263545 (owner: 10Milimetric) [23:44:47] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: don't add config file in $HOME if there is no such value [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/263544 (owner: 10ArielGlenn) [23:44:52] (03CR) 10Dzahn: "it fixes one of the links but not all" [puppet] - 10https://gerrit.wikimedia.org/r/255150 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [23:45:01] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1926105 (10eliza) Hello Ops Take that back. The exim alias for jimmy@ needs to be removed. Please do not close until this has been done. Eliza [23:45:22] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1926109 (10Dzahn) a:3Dzahn [23:45:36] 6operations, 7Mail: URGENT - remove jimmy@ alias from exim mail aliases - https://phabricator.wikimedia.org/T123315#1926111 (10Dzahn) [23:45:37] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1916901 (10Dzahn) [23:46:50] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1916901 (10Dzahn) Hi @Eliza, i have removed the alias on our side in T123315 [23:47:10] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1926117 (10Dzahn) [23:47:12] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1926118 (10Dzahn) [23:47:13] 6operations, 7Mail: URGENT - remove jimmy@ alias from exim mail aliases - https://phabricator.wikimedia.org/T123315#1926115 (10Dzahn) 5Open>3Resolved This has happened now. The alias has been deactivated on our side. [23:47:40] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1926119 (10Dzahn) 5Open>3Resolved [23:50:15] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1926121 (10eliza) ​Thank you Eliza​ [23:51:10] 6operations, 7Mail: remove or update ea@ alias? - https://phabricator.wikimedia.org/T123286#1926125 (10Dzahn) a:3JKrauska [23:53:54] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1926129 (10JKrauska) close this task -- accomplished using a google group [23:54:52] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Puppet has 1 failures [23:57:42] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures [23:57:53] (03CR) 10JanZerebecki: "Yes, that it doesn't change the first 2 is intentional. As long as the varnish TLS redirect is in place those never occur and can only be " [puppet] - 10https://gerrit.wikimedia.org/r/255150 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [23:58:31] YuviPanda: was that failure because of my patch?! ^ [23:58:46] no, that's mw and netmon, sorry [23:59:26] milimetric: :) [23:59:35] milimetric: let me know if the blacklist worked [23:59:40] (03PS1) 10Milimetric: Revert "Blacklist MobileWebSectionUsage from mysql" [puppet] - 10https://gerrit.wikimedia.org/r/263549