[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160114T0000). Please do the needful. [00:00:04] RoanKattouw tgr Krinkle ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:29] o/ [00:00:58] oh [00:01:01] \o [00:01:03] just as I was about to mess with wikitech [00:01:19] oh well, I suppose I can live hack + sync to silver [00:02:31] (03CR) 10Mobrovac: [C: 031] parsoid-rt-client: Save the localsettings file in the parsoid repo [puppet] - 10https://gerrit.wikimedia.org/r/264019 (owner: 10Subramanya Sastry) [00:03:17] tgr, woah, you want to roll out a new feature as part of a swat deployment? [00:03:49] Krenair: it's rolled out with the train [00:04:05] the config patch just disables it for non-SUL wikis [00:04:21] it's rolled out with the train next week, to be exact [00:04:37] it's default on? [00:04:44] yes [00:05:31] 6operations, 7Mail: remove exim aliases - mgodwin - https://phabricator.wikimedia.org/T123561#1933493 (10Dzahn) 5Open>3Resolved done. ``` -# Mike Godwin -mgodwin: legal -godwin: legal -mnemonic: legal -mike: legal - ``` [00:05:32] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1933495 (10Dzahn) [00:06:44] (03PS2) 10Alex Monk: Configure bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263804 (https://phabricator.wikimedia.org/T123451) (owner: 10Gergő Tisza) [00:06:55] (03CR) 10Alex Monk: [C: 032] Configure bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263804 (https://phabricator.wikimedia.org/T123451) (owner: 10Gergő Tisza) [00:07:21] (03Merged) 10jenkins-bot: Configure bot passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263804 (https://phabricator.wikimedia.org/T123451) (owner: 10Gergő Tisza) [00:07:26] uhhh [00:07:34] we really need to document how and why to use the $wg = $wmg hack [00:07:38] it isn't needed in this case ;) [00:07:42] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1933500 (10Dzahn) [00:08:33] !log krenair@tin Synchronized php-1.27.0-wmf.10/extensions/Echo/modules/echo.variables.less: https://gerrit.wikimedia.org/r/#/c/263767/ (duration: 00m 45s) [00:08:40] RoanKattouw_away, ^ [00:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:42] (03CR) 10Mobrovac: Add parsoid::testing role and use it on ruthenium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264024 (owner: 10Subramanya Sastry) [00:08:45] legoktm: see the gerrit comments [00:09:57] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/263804/ (duration: 00m 31s) [00:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:04] (03PS1) 10John Vandenberg: Sync ps_mem.py from origin [puppet] - 10https://gerrit.wikimedia.org/r/264028 [00:10:22] tgr: I think adding a comment would have been better than 3 unneeded global variables. [00:10:45] 6operations, 7Mail: remove gbyrd from exim alias file - https://phabricator.wikimedia.org/T123285#1933508 (10JKrauska) he's only included on box6699: mdennis, gbyrd, archive01 this is fine to remove.. (just his name here, not necessarily the alias -- will sync with mdennis) [00:11:03] (03CR) 10jenkins-bot: [V: 04-1] Sync ps_mem.py from origin [puppet] - 10https://gerrit.wikimedia.org/r/264028 (owner: 10John Vandenberg) [00:11:29] $wmg* is global? [00:11:32] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/263804/ (duration: 00m 31s) [00:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:11:44] tgr, ^ [00:11:55] Krenair: thanks; it's a noop for now [00:12:49] legoktm: I can fix that next SWAT if you feel strongly about it [00:16:47] (03CR) 10John Vandenberg: "note that this is the only pyflakes violation in modules/admin , so IMO this fix is fairly high priority." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264028 (owner: 10John Vandenberg) [00:17:33] PROBLEM - HHVM rendering on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:18:53] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:19:22] PROBLEM - Check size of conntrack table on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:19:23] PROBLEM - puppet last run on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:19:33] PROBLEM - RAID on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:19:42] PROBLEM - Disk space on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:19:52] PROBLEM - dhclient process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:03] PROBLEM - nutcracker port on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:14] PROBLEM - configured eth on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:22] PROBLEM - DPKG on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:20:32] PROBLEM - nutcracker process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:03] PROBLEM - HHVM processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:04] PROBLEM - salt-minion processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:12] PROBLEM - SSH on mw1142 is CRITICAL: Server answer [00:21:43] RECOVERY - dhclient process on mw1142 is OK: PROCS OK: 0 processes with command name dhclient [00:21:53] ebernhardson, hey [00:22:00] Krenair: wmf9 looks to be a preexisting error in someone's qunit outside cirrus [00:22:02] RECOVERY - nutcracker port on mw1142 is OK: TCP OK - 0.000 second response time on port 11212 [00:22:10] ebernhardson, you are ready for wmf10 though right? [00:22:12] RECOVERY - configured eth on mw1142 is OK: OK - interfaces up [00:22:22] Krenair: yes [00:25:25] 6operations, 7Mail: remove exim alias - rob - https://phabricator.wikimedia.org/T123602#1933544 (10JKrauska) 3NEW a:3Dzahn [00:25:39] Krenair: wmf9 should be good to go as well [00:26:00] thew qunit is completely unrelated to my patch [00:26:06] wmf10 is basically done [00:26:14] see https://gerrit.wikimedia.org/r/#/c/264029/ [00:26:16] it's stuck on a single host [00:26:18] probably mw1142 [00:26:26] judging from those alerts above [00:26:37] (03PS3) 10Subramanya Sastry: Add parsoid::testing role and use it on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/264024 [00:26:39] (03PS1) 10Subramanya Sastry: WIP: Add the visualdiff module; instantiate psd visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 [00:28:12] PROBLEM - dhclient process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:28:22] (03CR) 10jenkins-bot: [V: 04-1] WIP: Add the visualdiff module; instantiate psd visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (owner: 10Subramanya Sastry) [00:28:23] PROBLEM - nutcracker port on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:28:33] PROBLEM - configured eth on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:29:03] (03CR) 10Subramanya Sastry: Add parsoid::testing role and use it on ruthenium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264024 (owner: 10Subramanya Sastry) [00:29:21] yeah: krenair 31081 30394 0 00:24 ? 00:00:00 /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -F/dev/null -lmwdeploy mw1142.eqiad.wmnet [...] [00:29:23] RECOVERY - HHVM processes on mw1142 is OK: PROCS OK: 6 processes with command name hhvm [00:29:23] RECOVERY - salt-minion processes on mw1142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:30:13] !log krenair@tin Synchronized php-1.27.0-wmf.10/extensions/CirrusSearch/includes: https://gerrit.wikimedia.org/r/#q,263991,n,z (duration: 06m 08s) [00:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:21] (killed that process in the end) [00:30:25] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1933569 (10Dzahn) [00:30:26] 6operations, 7Mail: remove gbyrd from exim alias file - https://phabricator.wikimedia.org/T123285#1933567 (10Dzahn) 5Open>3Resolved done. ``` -box6699: mdennis, gbyrd, archive01 +box6699: mdennis, archive01 ``` [00:30:27] ori, can you share your .vimrc settings for puppet .. "autocmd BufNewFile,BufRead *.pp :set ts=4 sw=4 expandtab" doesn't seem to be getting vim to expand tabs in pp files.. lint continues to be unhappy with me. [00:31:49] uhhh... Krinkle [00:32:04] Is https://gerrit.wikimedia.org/r/#/c/264033/ the fix for ebernhardson's wmf9 issue? [00:32:32] RECOVERY - dhclient process on mw1142 is OK: PROCS OK: 0 processes with command name dhclient [00:33:12] subbu: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/admin/files/home/ori/.vimrc [00:33:17] mutante, I think mw1142 is broken, mind taking a look? [00:33:22] Krenair: sure [00:33:39] mutante, thanks [00:34:33] RECOVERY - nutcracker port on mw1142 is OK: TCP OK - 0.000 second response time on port 11212 [00:34:52] RECOVERY - configured eth on mw1142 is OK: OK - interfaces up [00:34:53] RECOVERY - DPKG on mw1142 is OK: All packages OK [00:34:54] Krenair: i just logged in ,, didnt do anything [00:35:03] RECOVERY - nutcracker process on mw1142 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:35:05] yeah [00:35:10] it's suddenly started letting me in [00:35:28] before, I got ssh_exchange_identification: Connection closed by remote host [00:35:32] high load , hhvm [00:35:33] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.186 second response time [00:35:53] RECOVERY - SSH on mw1142 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [00:36:02] RECOVERY - Check size of conntrack table on mw1142 is OK: OK: nf_conntrack is 1 % full [00:36:13] RECOVERY - RAID on mw1142 is OK: OK: no RAID installed [00:36:14] RECOVERY - Disk space on mw1142 is OK: DISK OK [00:36:22] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 67548 bytes in 1.854 second response time [00:36:30] 2519 Jan 14 00:34:26 mw1142 kernel: [34603565.282270] init: hhvm main process (17411) killed by KILL signal [00:36:54] subbu: I have "au BufRead,BufNewFile *.pp setlocal ft=puppet au", "Filetype puppet setlocal ts=4 sw=4 sts=4 et tw=80 sta", but I also have https://github.com/rodjek/vim-puppet installed [00:37:48] mw1142 kernel: [34603195.163796] Out of memory: Kill process 17411 (hhvm) score 930 or sacrifice child [00:38:03] and when it was killed .. the server came back [00:38:56] Krinkle, ping, I want to get ebernhardson's patch done... [00:39:28] * ebernhardson would [00:39:34] * ebernhardson would V+2 :P [00:39:35] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1933577 (10Dzahn) [00:39:36] 6operations, 7Mail: remove exim aliases -- usability, usability team - https://phabricator.wikimedia.org/T123575#1933575 (10Dzahn) 5Open>3Resolved done ``` -# Usability project - -usabilityteam: pvora, tparscal, flaxxen@googlemail.com, roan.kattouw@gmail.com, aaron.wright@gmail.com, ngautam, rlane32, ami... [00:40:27] ebernhardson, I did V+2 it, problem is another patch which I didn't approve has also been merged [00:40:34] (03PS1) 10Yuvipanda: Log webservice invocations to EL [puppet] - 10https://gerrit.wikimedia.org/r/264035 (https://phabricator.wikimedia.org/T123444) [00:40:35] oh :S [00:41:39] ori, thanks .. something else is broken in my .vimrc i think. [00:42:00] i won't go down that rabbithole now. [00:42:56] (03CR) 10Yuvipanda: [C: 032] Log webservice invocations to EL [puppet] - 10https://gerrit.wikimedia.org/r/264035 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [00:43:16] 6operations, 7Mail: remove exim alias - mkahn - https://phabricator.wikimedia.org/T123562#1933581 (10Dzahn) 5Open>3Resolved done ``` -# Marlita Kahn -marlita: mkahn - ``` [00:43:17] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1933583 (10Dzahn) [00:44:22] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [00:44:56] Krenair: well, i have to run in a cpl minutes to catch a train....i gues ill just wait for train... [00:45:18] (two different trains :P train home and deploy train) [00:50:54] 6operations, 7Mail: remove exim alias - comp_committee - https://phabricator.wikimedia.org/T123605#1933594 (10JKrauska) 3NEW a:3Dzahn [00:51:03] ebernhardson, still around? [00:55:48] meh [00:55:49] reverting [00:57:04] (03PS1) 10Yuvipanda: toolschecker: Should use webservice rather than webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/264039 [00:57:34] also reverted the patch that got in my way [00:57:52] (03CR) 10Yuvipanda: [C: 032 V: 032] toolschecker: Should use webservice rather than webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/264039 (owner: 10Yuvipanda) [00:58:32] PROBLEM - SSH on mw1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:34] PROBLEM - RAID on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:58:43] PROBLEM - nutcracker process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:58:55] ori can you take a look at and +2 https://gerrit.wikimedia.org/r/#/c/264019 ? [00:59:12] PROBLEM - RAID on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:13] PROBLEM - DPKG on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:13] PROBLEM - puppet last run on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:14] (03PS2) 10Ori.livneh: parsoid-rt-client: Save the localsettings file in the parsoid repo [puppet] - 10https://gerrit.wikimedia.org/r/264019 (owner: 10Subramanya Sastry) [00:59:23] (03CR) 10Ori.livneh: [C: 032 V: 032] parsoid-rt-client: Save the localsettings file in the parsoid repo [puppet] - 10https://gerrit.wikimedia.org/r/264019 (owner: 10Subramanya Sastry) [00:59:34] PROBLEM - salt-minion processes on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:34] PROBLEM - configured eth on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:50] thx [00:59:57] now it's mw1015 and mw1009 causing problems [01:00:05] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160114T0100). Please do the needful. [01:00:13] !log krenair@tin Synchronized php-1.27.0-wmf.10/extensions/VisualEditor/extension.json: https://gerrit.wikimedia.org/r/#/c/264031/ (duration: 01m 35s) [01:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:00:33] RECOVERY - SSH on mw1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [01:01:33] mutante, same problem with these? [01:01:40] (03PS1) 10ArielGlenn: rewrite pagerange.py so it's both fast and useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/264040 (https://phabricator.wikimedia.org/T123571) [01:01:53] PROBLEM - puppet last run on mw1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:22] PROBLEM - dhclient process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:03:03] RECOVERY - nutcracker process on mw1015 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:04:32] RECOVERY - dhclient process on mw1015 is OK: PROCS OK: 0 processes with command name dhclient [01:05:32] RECOVERY - RAID on mw1009 is OK: OK: no RAID installed [01:05:42] mwdeploy@mw1009:~$ sync-common [01:05:43] -bash: sync-common: command not found [01:05:44] uh, what? [01:05:53] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 39 minutes ago with 0 failures [01:06:23] !log mw1009 - restarted hhvm [01:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:07:03] PROBLEM - SSH on mw1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:07:27] 6operations, 6Performance-Team, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, and 3 others: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1933648 (10ori) I think I have it isolated -- it's `XMLReader::expand`: ```name=T122069.php ^ AaronSchulz [01:07:45] isn't sync-common supposed to be present on that host? [01:08:12] RECOVERY - salt-minion processes on mw1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:10:32] Krenair: those are jobrunners [01:10:51] that will be the different about sync-common i think [01:11:01] I thought it existed on all mw hosts [01:11:12] RECOVERY - SSH on mw1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [01:11:13] ori: isn't that just https://github.com/facebook/hhvm/issues/3899 ? [01:11:14] for some reason [01:11:39] Krenair: i think the bug ori is talking about is the same thing we see [01:11:44] RECOVERY - DPKG on mw1015 is OK: All packages OK [01:11:53] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 30 minutes ago with 0 failures [01:12:12] RECOVERY - configured eth on mw1015 is OK: OK - interfaces up [01:12:31] (03CR) 10Andrew Bogott: [C: 031] "Wow!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264023 (owner: 10Reedy) [01:13:24] Krenair: probably from mediawiki::appserver the scap utils are included [01:13:28] (03CR) 10John Vandenberg: Add flake8 rule for selected modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/263866 (owner: 10John Vandenberg) [01:14:33] ok [01:14:38] you have to do the full path I think [01:15:23] RECOVERY - RAID on mw1015 is OK: OK: no RAID installed [01:16:42] (03CR) 10John Vandenberg: Add flake8 rule for selected modules (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/263866 (owner: 10John Vandenberg) [01:22:52] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:23:33] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:44] PROBLEM - RAID on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:13] PROBLEM - puppet last run on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:27:56] !log krenair@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [01:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:28:43] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 6 failures [01:29:03] RECOVERY - DPKG on mw1005 is OK: All packages OK [01:29:17] Reedy, wikitech rolled back to wmf.9 [01:29:52] Krenair: I already verified earlier. Sorry for the delay. [01:29:53] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [01:29:54] andrewbogott, ^ [01:30:07] I saw it going out and tested it, forgot to ping you back [01:30:22] the VE patch? [01:30:56] Krenair: thanks [01:31:51] Krenair: seems better for the moment [01:32:03] !log Wikitech rolled back to wmf.9 due to T123583 [01:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:32:30] This wiki contains 67,501 property values for a total of 59 different properties. 83 properties have an own page, and the intended data type is specified for 83 of those. [01:32:33] from https://wikitech.wikimedia.org/wiki/Special:SemanticStatistics [01:33:12] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:33:21] (03CR) 10Mobrovac: Add parsoid::testing role and use it on ruthenium (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/264024 (owner: 10Subramanya Sastry) [01:35:04] andrewbogott, Reedy: Okay, so [01:35:11] We can consider undeploying the extensions [01:35:29] We can consider adding those functions back in wikitech.php [01:36:06] We're not going to re-add the functions to MediaWiki core. [01:36:37] We could put effort into to patching the extensions. [01:37:13] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [01:37:23] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:37:44] (03CR) 10Mobrovac: "This code does not install the parsoid service files nor does it start the service itself. Is that intended?" [puppet] - 10https://gerrit.wikimedia.org/r/264024 (owner: 10Subramanya Sastry) [01:38:22] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:39:23] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:40:51] (03CR) 10Subramanya Sastry: "Right now yes .. since rt-testing starts up a parsoid service for each of the testreduce clients. But, if we need parsoid itself running (" [puppet] - 10https://gerrit.wikimedia.org/r/264024 (owner: 10Subramanya Sastry) [01:41:43] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:43:33] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:43:54] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:46:02] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [01:46:13] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:47:02] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:48:14] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [01:49:22] PROBLEM - DPKG on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:49:42] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:22] PROBLEM - configured eth on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:24] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:33] PROBLEM - RAID on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:42] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [01:51:13] RECOVERY - DPKG on mw1013 is OK: All packages OK [01:51:32] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: puppet fail [01:51:33] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:52:13] RECOVERY - configured eth on mw1013 is OK: OK - interfaces up [01:52:23] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [01:52:32] RECOVERY - RAID on mw1013 is OK: OK: no RAID installed [01:56:12] RECOVERY - DPKG on mw1005 is OK: All packages OK [01:56:53] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:33] PROBLEM - RAID on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:02:42] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:02:53] PROBLEM - configured eth on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:03:13] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:03:24] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:03:43] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:04:33] PROBLEM - SSH on mw1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:04:43] PROBLEM - DPKG on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:05:53] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:05:53] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:06:03] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:08:42] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:09:03] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:11:23] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:16:13] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:17:33] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [02:19:32] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [02:25:53] (03PS1) 10Andrew Bogott: Added icinga check for 'showmount' on tools instances. [puppet] - 10https://gerrit.wikimedia.org/r/264049 (https://phabricator.wikimedia.org/T123588) [02:27:12] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:27:12] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [02:28:53] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:29:43] RECOVERY - DPKG on mw1005 is OK: All packages OK [02:29:43] RECOVERY - DPKG on mw1011 is OK: All packages OK [02:30:33] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [02:30:53] RECOVERY - Disk space on mw1011 is OK: DISK OK [02:30:53] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:30:57] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 12m 21s) [02:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:33] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [02:34:32] (03CR) 10Mobrovac: Puppet provider for scap3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/262742 (owner: 10Alexandros Kosiaris) [02:34:42] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [02:40:52] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:42:02] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:12] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:13] PROBLEM - DPKG on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:53] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:43:42] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:13] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:42] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:02] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:42] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:43] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:54] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [02:52:52] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [02:53:22] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [02:53:43] PROBLEM - HHVM rendering on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:54:12] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:55:43] PROBLEM - nutcracker process on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:44] PROBLEM - Check size of conntrack table on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:56:02] PROBLEM - salt-minion processes on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:56:13] PROBLEM - Disk space on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:56:32] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [02:56:52] PROBLEM - DPKG on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:56:53] PROBLEM - configured eth on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:57:42] RECOVERY - nutcracker process on mw1116 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:57:43] RECOVERY - Check size of conntrack table on mw1116 is OK: OK: nf_conntrack is 0 % full [02:57:54] RECOVERY - salt-minion processes on mw1116 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:58:13] RECOVERY - Disk space on mw1116 is OK: DISK OK [02:58:52] RECOVERY - DPKG on mw1116 is OK: All packages OK [02:58:53] RECOVERY - configured eth on mw1116 is OK: OK - interfaces up [02:59:33] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:09:13] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:10:42] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:11:43] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:12:03] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:13:13] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [03:13:33] RECOVERY - DPKG on mw1005 is OK: All packages OK [03:14:52] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:16:12] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [03:19:33] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:43] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:21:33] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:33] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [03:23:34] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [03:26:12] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [03:26:42] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [03:28:02] RECOVERY - DPKG on mw1005 is OK: All packages OK [03:29:04] RECOVERY - RAID on mw1005 is OK: OK: no RAID installed [03:35:43] PROBLEM - RAID on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:36:23] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:37:23] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:23] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [03:41:22] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:04] PROBLEM - salt-minion processes on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:14] PROBLEM - Disk space on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:42] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:42] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:53] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:45:12] PROBLEM - dhclient process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:46:53] RECOVERY - DPKG on mw1005 is OK: All packages OK [03:53:12] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:56:23] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:56:33] RECOVERY - Disk space on mw1005 is OK: DISK OK [03:57:03] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [03:57:03] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [03:57:12] RECOVERY - DPKG on mw1005 is OK: All packages OK [03:57:32] RECOVERY - dhclient process on mw1005 is OK: PROCS OK: 0 processes with command name dhclient [03:57:43] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [03:58:03] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [04:22:13] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:22:22] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:33] PROBLEM - DPKG on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:47] !log restart hhvm on odd-numbered appservers [04:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:26:13] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:26:22] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [04:28:37] !log powercycling mw1005/mw1011 [04:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:29:23] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [04:29:43] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [04:29:53] RECOVERY - Disk space on mw1011 is OK: DISK OK [04:29:53] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:29:53] RECOVERY - RAID on mw1011 is OK: OK: no RAID installed [04:30:04] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:30:42] RECOVERY - SSH on mw1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [04:30:52] RECOVERY - DPKG on mw1005 is OK: All packages OK [04:30:52] RECOVERY - DPKG on mw1011 is OK: All packages OK [04:31:12] RECOVERY - configured eth on mw1011 is OK: OK - interfaces up [04:32:03] RECOVERY - RAID on mw1005 is OK: OK: no RAID installed [04:32:43] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [04:42:03] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 67356 bytes in 0.578 second response time [04:42:23] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.087 second response time [04:42:33] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:19:22] (03PS1) 10Legoktm: Hide .error from extracts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264051 [05:22:47] (03CR) 10Legoktm: [C: 04-1] "Hmm, but this is in the extension defaults..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264051 (owner: 10Legoktm) [05:25:16] (03Abandoned) 10Legoktm: Hide .error from extracts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264051 (owner: 10Legoktm) [05:28:41] 7Blocked-on-Operations, 10Dumps-Generation, 10Flow, 3Collaboration-Team-Current: Publish recurring Flow dumps at http://dumps.wikimedia.org/ - https://phabricator.wikimedia.org/T119511#1933838 (10Mattflaschen) [05:49:02] PROBLEM - puppet last run on mw1122 is CRITICAL: CRITICAL: Puppet has 74 failures [06:31:03] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:23] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:44] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:52] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:02] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:02] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:03] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:23] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:32] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:44] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:02] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:12] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:05] just kidding [06:42:28] <_joe_> lol [06:42:32] <_joe_> good morning icinga [06:45:43] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:41] * YuviPanda waves at _joe_ and ori [06:49:47] 6operations, 6Performance-Team, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, and 3 others: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1933862 (10ori) This leak was [[ https://github.com/facebook/hhvm/issues/3899 | reported ]] and [[ https://reviews.facebook.net/D35439 | fixed... [06:49:53] I wonder if the nightly spam will get fixed in 2016 [06:50:00] 'year of the nightly spam fix!' [06:51:00] (03Abandoned) 10GWicke: Varnish: Don't disable caching for authenticated REST API requests [puppet] - 10https://gerrit.wikimedia.org/r/261662 (https://phabricator.wikimedia.org/T122673) (owner: 10GWicke) [06:55:58] _joe_: any particular reason why you used augeas for apache::logrotate? [06:56:02] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:56:22] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:23] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:56:23] <_joe_> paravoid: uh, I don't remember at all [06:56:40] <_joe_> paravoid: probably I wanted to change a couple of lines only? [06:56:43] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:53] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:57:02] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:22] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:32] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:43] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:57:54] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:58:45] told you i was just kidding [06:58:53] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:59] <_joe_> paravoid: ok, I think I used augeas as opposed to file_line because we already used it elsewhere for logrotate, and it's also more efficient for multiple lines [07:01:35] root@palladium:~# cat /etc/logrotate.d/apache2 [07:01:35] ##################################################################### [07:01:38] ### THIS FILE IS MANAGED BY PUPPET [07:01:38] that doesn't exist anymoer [07:01:40] ### puppet:///modules/puppetmaster/logrotate-passenger [07:01:43] ##################################################################### [07:01:51] since b8a7205d67fad98424e3cf625d7afd4386e6a32f [07:02:19] i thought you were speaking damon for a second [07:08:46] <_joe_> paravoid: that's palladium, which probably didn't use apache2::logrotate? [07:09:05] ori: lol [07:09:15] _joe_: see that commit [07:11:38] <_joe_> paravoid: uhm, right. that script is identical (if not for the header) to what is shipped by ubuntu [07:13:15] <_joe_> paravoid: so I actually removed a full copy-paste since we just needed to tweak the logrotate periodicity and nothing else [07:28:55] (03CR) 10Giuseppe Lavagetto: "The general idea seems correct to me, I still need to check the details though." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197499 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [07:35:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "On hold until the cache_mobile migration is done." [puppet] - 10https://gerrit.wikimedia.org/r/263847 (owner: 10Giuseppe Lavagetto) [07:36:32] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures [07:56:47] (03PS1) 10Andrew Bogott: Add openldap to labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/264053 [08:01:33] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [08:21:36] (03PS1) 10Giuseppe Lavagetto: jobrunner: contain gwt jobs to run on two specific hosts [puppet] - 10https://gerrit.wikimedia.org/r/264055 (https://phabricator.wikimedia.org/T122069) [08:21:50] <_joe_> ori, if you're still around ^^ [08:28:02] PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: puppet fail [08:28:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:28:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:29:23] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 3 failures [08:29:23] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 3 failures [08:29:23] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 3 failures [08:32:33] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:33:56] <_joe_> uh? [08:34:33] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:34:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:34:34] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:36:33] 6operations, 10Wikimedia-SVG-rendering: Install Noto CJK (Source Han Sans) font family for SVG rendering - https://phabricator.wikimedia.org/T123223#1933955 (10PhiLiP) OK... If there's anything I can do to help get this done? I mean I can even help to create pull-request to puppet.git. [08:54:23] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [08:54:23] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:54:23] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:55:12] RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:00:11] (03PS2) 10Giuseppe Lavagetto: jobrunner: contain gwt jobs to run on two specific hosts [puppet] - 10https://gerrit.wikimedia.org/r/264055 (https://phabricator.wikimedia.org/T122069) [09:00:49] (03PS2) 10Alexandros Kosiaris: Trivial comment fix for pep8 [puppet] - 10https://gerrit.wikimedia.org/r/263837 (owner: 10Chad) [09:17:13] PROBLEM - puppet last run on mw2109 is CRITICAL: CRITICAL: puppet fail [09:19:57] (03CR) 10Alexandros Kosiaris: [C: 032] Trivial comment fix for pep8 [puppet] - 10https://gerrit.wikimedia.org/r/263837 (owner: 10Chad) [09:21:30] (03PS2) 10Alexandros Kosiaris: Incorrect syntax in RCStreamCollector.collect [puppet] - 10https://gerrit.wikimedia.org/r/263848 (owner: 10John Vandenberg) [09:25:45] (03CR) 10Alexandros Kosiaris: [C: 032] Incorrect syntax in RCStreamCollector.collect [puppet] - 10https://gerrit.wikimedia.org/r/263848 (owner: 10John Vandenberg) [09:31:07] (03PS1) 10Alexandros Kosiaris: Add system::role to piwik role [puppet] - 10https://gerrit.wikimedia.org/r/264058 [09:38:00] (03CR) 10Florianschmidtwelzow: [C: 031] "I could cry! :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264023 (owner: 10Reedy) [09:44:23] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:46:24] (03PS2) 10Alexandros Kosiaris: Revert "Revert "Add the LVS blocks to url_downloader"" [puppet] - 10https://gerrit.wikimedia.org/r/207490 [09:46:26] (03PS1) 10Alexandros Kosiaris: Make role::url_downloader unparameterized [puppet] - 10https://gerrit.wikimedia.org/r/264059 [09:48:36] (03CR) 10Filippo Giunchedi: [C: 04-1] "awk nit, lgtm otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/264055 (https://phabricator.wikimedia.org/T122069) (owner: 10Giuseppe Lavagetto) [09:49:57] <_joe_> godog: d'oh, you're right [09:50:55] hehe gawk is great [09:52:13] (03CR) 10Ema: [C: 032 V: 032] puppet-merge: auto-run conftool-merge [puppet] - 10https://gerrit.wikimedia.org/r/263821 (owner: 10Giuseppe Lavagetto) [09:52:38] (03PS2) 10Ema: puppet-merge: auto-run conftool-merge [puppet] - 10https://gerrit.wikimedia.org/r/263821 (owner: 10Giuseppe Lavagetto) [09:53:13] <_joe_> yeah, the problem is that, as usual, when you're modifying some code, you try to reduce changes to swapping out parts [09:53:24] (03CR) 10Ema: [V: 032] puppet-merge: auto-run conftool-merge [puppet] - 10https://gerrit.wikimedia.org/r/263821 (owner: 10Giuseppe Lavagetto) [09:56:24] 6operations, 10Gerrit, 10GitHub-Mirrors, 10ValueView, 10Wikidata: [Task] Redirect unused extensions/ValueView repository to data-values/value-view - https://phabricator.wikimedia.org/T123624#1934050 (10thiemowmde) 3NEW [09:56:31] (03PS3) 10Giuseppe Lavagetto: jobrunner: contain gwt jobs to run on two specific hosts [puppet] - 10https://gerrit.wikimedia.org/r/264055 (https://phabricator.wikimedia.org/T122069) [09:56:38] 6operations, 10Gerrit, 10GitHub-Mirrors, 10ValueView, 10Wikidata: [Task] Redirect unused extensions/ValueView repository to data-values/value-view - https://phabricator.wikimedia.org/T123624#1934059 (10thiemowmde) [10:01:02] PROBLEM - salt-minion processes on cygnus is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:02:34] (03CR) 10Filippo Giunchedi: [C: 031] jobrunner: contain gwt jobs to run on two specific hosts [puppet] - 10https://gerrit.wikimedia.org/r/264055 (https://phabricator.wikimedia.org/T122069) (owner: 10Giuseppe Lavagetto) [10:04:49] paravoid: I took a quick look at the meeting notes but couldn't find the numbers you ran re: precise systems, in case https://phabricator.wikimedia.org/T123525 [10:07:14] godog: I did these before the goals setting meeting in December and made a Phab paste of it, let me search for it [10:07:22] RECOVERY - salt-minion processes on cygnus is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:07:34] godog: https://phabricator.wikimedia.org/P2366 [10:08:17] ah thanks moritzm ! I seemed to remember sth like that [10:08:52] it's a little less currently, bblack updated lvs100[1-3] to jessie already [10:10:15] (03PS4) 10Giuseppe Lavagetto: jobrunner: contain gwt jobs to run on two specific hosts [puppet] - 10https://gerrit.wikimedia.org/r/264055 (https://phabricator.wikimedia.org/T122069) [10:10:29] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: contain gwt jobs to run on two specific hosts [puppet] - 10https://gerrit.wikimedia.org/r/264055 (https://phabricator.wikimedia.org/T122069) (owner: 10Giuseppe Lavagetto) [10:11:32] <_joe_> come on jenkins... [10:13:33] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:14:15] <_joe_> oh, ema ^^ [10:14:27] <_joe_> last puppet-merge failed on strontium for $reason [10:14:32] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [10:14:49] <_joe_> uh, no, you just forgot to do it :P [10:15:42] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:15:54] <_joe_> I did it for you [10:16:33] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [10:18:34] <_joe_> grr puppet and friggin absented resources :/ [10:19:37] 6operations: reduce amount of remaining Ubuntu 12.04 systems - https://phabricator.wikimedia.org/T123525#1934083 (10fgiunchedi) salt query for precise systems, see also https://phabricator.wikimedia.org/P2366 for a list compiled in december ```lines=5 # salt --out=txt -C 'G@lsb_distrib_codename:precise' test.pi... [10:29:00] !log installed DHCP security updates on carbon [10:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:35:08] <_joe_> heh /win 31 [10:38:01] <_joe_> !log restarting hhvm on odd-numbered jobrunners [10:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:38:13] 7Blocked-on-Operations, 6operations, 10RESTBase, 10procurement: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1934089 (10mark) a:3RobH @RobH: Alright, let's get some quotes for these SSDs. I assume we do have drive slots available? [10:39:39] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1934096 (10MoritzMuehlenhoff) It's working well for me, I ran salt commands for the entire cluster using host-based '*' matching (both using batches of 200 hosts and w/o batching) and it worked... [10:44:09] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1934108 (10Joe) I can report the same; just ran some queries that have long been unreliable but it's fast and apparently really, really reliable. Didn't have any problems in the last two days... [10:50:37] (03PS3) 10Reedy: Remove www.([a-z-]+) rewrites [puppet] - 10https://gerrit.wikimedia.org/r/256441 (https://phabricator.wikimedia.org/T120143) [10:54:22] (03PS1) 10Giuseppe Lavagetto: jobrunners: actually absent the hhvm restarting cron where not needed [puppet] - 10https://gerrit.wikimedia.org/r/264063 [10:54:31] (03CR) 10Reedy: "Actually removing my -1. https://www.m.wikipedia.org/ doesn't redirect if you visit it" [puppet] - 10https://gerrit.wikimedia.org/r/256441 (https://phabricator.wikimedia.org/T120143) (owner: 10Reedy) [10:54:56] (03PS2) 10Reedy: Fix apple-touch-icon.png on wikipedias [puppet] - 10https://gerrit.wikimedia.org/r/256437 (https://phabricator.wikimedia.org/T115965) [10:55:05] (03PS2) 10Reedy: Add apple-touch-icon.png to Wikidata, Wikinews and Wiktionary [puppet] - 10https://gerrit.wikimedia.org/r/256440 [10:59:39] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunners: actually absent the hhvm restarting cron where not needed [puppet] - 10https://gerrit.wikimedia.org/r/264063 (owner: 10Giuseppe Lavagetto) [11:18:46] !log upgrade graphite-carbon / graphite-web on labmon1001 [11:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:24:32] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:25:17] that's me, taking a while to upgrade [11:26:33] RECOVERY - DPKG on labmon1001 is OK: All packages OK [11:28:17] !log bounce uwsgi on labmon1001 [11:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:34:54] Hi! I am located in Munich Germany. https://www.wikidata.org/wiki/Special:RecentChangesLinked/Wikidata:Database_reports/WMF_projects does not load [11:35:44] neither https://www.wikidata.org/wiki/Special:Watchlist?days=0&namespace=3&action=submit [11:36:38] there is a phabricator report opened two days ago I am using the account [[d:user:I18n]] [11:40:22] PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: puppet fail [11:49:52] (03PS1) 10Pmlineditor: Add namespace aliases for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264066 (https://phabricator.wikimedia.org/T123187) [11:53:24] (03PS2) 10Muehlenhoff: Ensure unique uidNumbers with slapo-overlay [puppet] - 10https://gerrit.wikimedia.org/r/263363 (https://phabricator.wikimedia.org/T122665) [11:55:23] (03CR) 10Muehlenhoff: [C: 032 V: 032] Ensure unique uidNumbers with slapo-overlay [puppet] - 10https://gerrit.wikimedia.org/r/263363 (https://phabricator.wikimedia.org/T122665) (owner: 10Muehlenhoff) [11:55:48] (03CR) 10Lydia Pintscher: [C: 031] "I think the setting is not general enough to be a default setting for everyone." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [12:07:44] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:19:41] 6operations: Upgrade aqs* to nodejs 4.2 - https://phabricator.wikimedia.org/T123629#1934216 (10MoritzMuehlenhoff) 3NEW [13:22:03] (03CR) 10Aude: "@lydia @thiemo imho, having the default makes it easier to setup wikibase and have it work somewhat like wikidata. (and also provides an e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [13:24:50] 6operations, 10Wikimedia-SVG-rendering: Install Noto CJK (Source Han Sans) font family for SVG rendering - https://phabricator.wikimedia.org/T123223#1934228 (10Aklapper) Help welcome: https://www.mediawiki.org/wiki/Gerrit/Tutorial :) [13:52:12] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [13:54:12] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [14:06:21] (03CR) 10Alex Monk: "Is this something that can go in swat?" [puppet] - 10https://gerrit.wikimedia.org/r/244237 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [14:06:54] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:07] (03PS1) 10Muehlenhoff: Add salt grains for piwik role [puppet] - 10https://gerrit.wikimedia.org/r/264075 [14:09:09] (03PS1) 10Muehlenhoff: Add salt grains for openldap::labs and extend openldap server group for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/264076 [14:09:42] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:09:43] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 58.33% of data above the critical threshold [5000000.0] [14:09:52] PROBLEM - configured eth on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:54] PROBLEM - nutcracker port on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:13] PROBLEM - RAID on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:13] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:43] PROBLEM - Disk space on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:44] PROBLEM - salt-minion processes on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:12] PROBLEM - DPKG on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:23] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:17:32] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [14:19:33] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [14:21:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for piwik role [puppet] - 10https://gerrit.wikimedia.org/r/264075 (owner: 10Muehlenhoff) [14:22:22] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [14:24:42] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1934268 (10Krenair) [14:24:44] 6operations, 10Parsoid, 10Wikimedia-Site-Requests: please deploy parsoid sitematrix update - https://phabricator.wikimedia.org/T122548#1934265 (10Krenair) 5Open>3Resolved a:3Krenair I think this got fixed at some point, VE loads on wikimania2017wiki now. [14:32:25] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for openldap::labs and extend openldap server group for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/264076 (owner: 10Muehlenhoff) [14:34:23] 6operations: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1934288 (10fgiunchedi) [14:37:05] 6operations, 7Mail: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#1934295 (10chasemp) no, not really, it was a convenience at the time. [14:37:13] 6operations, 5Patch-For-Review, 7Swift: swift upgrade plans - https://phabricator.wikimedia.org/T117972#1934297 (10fgiunchedi) yet another option proposed would be to stick with trusty but upgrade swift using openstack libery packages from https://wiki.ubuntu.com/ServerTeam/CloudArchive this includes upgradi... [14:41:37] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/263840 (owner: 10Filippo Giunchedi) [14:44:38] <_joe_> !powercycling mw1013, console stuck [14:44:44] <_joe_> !log powercycling mw1013, console stuck [14:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:46:54] RECOVERY - DPKG on mw1013 is OK: All packages OK [14:47:12] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:47:34] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [14:47:34] RECOVERY - configured eth on mw1013 is OK: OK - interfaces up [14:47:52] RECOVERY - nutcracker port on mw1013 is OK: TCP OK - 0.000 second response time on port 11212 [14:48:12] RECOVERY - RAID on mw1013 is OK: OK: no RAID installed [14:48:12] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [14:48:33] RECOVERY - Disk space on mw1013 is OK: DISK OK [14:48:42] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:50:24] (03CR) 10Giuseppe Lavagetto: [C: 032] Re-think the alerts instrumentation [debs/pybal] - 10https://gerrit.wikimedia.org/r/261183 (owner: 10Giuseppe Lavagetto) [14:50:28] (03PS2) 10Giuseppe Lavagetto: Re-think the alerts instrumentation [debs/pybal] - 10https://gerrit.wikimedia.org/r/261183 [15:00:01] (03Abandoned) 10Giuseppe Lavagetto: Add a warning endpoint to catch misconfigurations [debs/pybal] - 10https://gerrit.wikimedia.org/r/257324 (owner: 10Giuseppe Lavagetto) [15:05:43] (03PS2) 10Giuseppe Lavagetto: Add warning for badly configured pools. [debs/pybal] - 10https://gerrit.wikimedia.org/r/261193 [15:06:44] (03CR) 10Giuseppe Lavagetto: [C: 032] Add warning for badly configured pools. [debs/pybal] - 10https://gerrit.wikimedia.org/r/261193 (owner: 10Giuseppe Lavagetto) [15:08:00] (03Merged) 10jenkins-bot: Add warning for badly configured pools. [debs/pybal] - 10https://gerrit.wikimedia.org/r/261193 (owner: 10Giuseppe Lavagetto) [15:15:06] (03PS1) 10Cmjohnson: Adding dhcp entries for pc1004-6 [puppet] - 10https://gerrit.wikimedia.org/r/264080 [15:15:48] (03CR) 10Cmjohnson: [C: 032] Adding dhcp entries for pc1004-6 [puppet] - 10https://gerrit.wikimedia.org/r/264080 (owner: 10Cmjohnson) [15:21:45] (03CR) 10MZMcBride: "Associated Phabricator Maniphest task?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264023 (owner: 10Reedy) [15:22:59] (03PS2) 10Reedy: Add lots of wfMsg*() for wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264023 (https://phabricator.wikimedia.org/T123583) [15:43:41] 6operations, 10hardware-requests: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1934435 (10GWicke) Lets coordinate with the CPU / memory upgrades in T121255 to avoid multiple downtimes. I'd propose the following procedure: 1) Shut down the cassandra n... [15:46:53] Luke081515, around? [15:47:07] jouncebot, next [15:47:07] In 0 hour(s) and 12 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160114T1600) [15:47:47] Krenair: yep [15:48:09] !log installed DHCP security updates across the fleet [15:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:00] Luke081515, so we have 1 patch from you to go in swat [15:49:14] but there's also loads of open site requests from the freeze: [15:50:03] https://phabricator.wikimedia.org/T122045 https://phabricator.wikimedia.org/T121985 https://phabricator.wikimedia.org/T119816 https://phabricator.wikimedia.org/T121524 https://phabricator.wikimedia.org/T122175 https://phabricator.wikimedia.org/T122441 https://phabricator.wikimedia.org/T123084 [15:52:42] legoktm, what about https://gerrit.wikimedia.org/r/#/c/237686/ ? [15:58:54] does anyone know why s1-analytics-slave.eqiad.wmnet is in --read-only mode? [15:59:30] (same for all other s*-analytics-slave I tried) [16:00:04] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160114T1600). [16:00:04] Luke081515: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:14] I'm here :) [16:00:23] I can do this [16:00:41] uh, interesting [16:00:47] I appear to be missing from the list [16:00:47] but ok [16:02:03] Luke081515: Needs a rebase [16:02:06] (manual) [16:02:44] ok [16:04:45] (03PS1) 10Alexandros Kosiaris: Remove otrs-test.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/264086 [16:09:56] (03PS2) 10Luke081515: Add exceptions for eswiki, eswikivoyage at 2016-01-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263414 (https://phabricator.wikimedia.org/T123261) [16:10:01] (03PS1) 10Alexandros Kosiaris: misc-web: Remove otrs-test routing rules [puppet] - 10https://gerrit.wikimedia.org/r/264087 [16:10:11] (03CR) 10jenkins-bot: [V: 04-1] Add exceptions for eswiki, eswikivoyage at 2016-01-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263414 (https://phabricator.wikimedia.org/T123261) (owner: 10Luke081515) [16:10:49] ok, I will try it again... [16:14:00] krenair, ostriches: If there are other patches, you can conitnue with them first, will need a few minutes... [16:14:27] Nope, you're the only one this morning [16:14:30] Pressure's on! [16:14:30] :p [16:17:11] I hope the verify bot is on my side this time^^ [16:17:26] (03PS3) 10Luke081515: Add exceptions for eswiki, eswikivoyage at 2016-01-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263414 (https://phabricator.wikimedia.org/T123261) [16:18:00] ostriches: Verfied +2, you can move on :) [16:18:10] oh, moment [16:18:23] my fault, to spaces there... [16:18:34] (03CR) 10Chad: [C: 032] Add exceptions for eswiki, eswikivoyage at 2016-01-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263414 (https://phabricator.wikimedia.org/T123261) (owner: 10Luke081515) [16:18:55] (03PS4) 10Luke081515: Add exceptions for eswiki, eswikivoyage at 2016-01-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263414 (https://phabricator.wikimedia.org/T123261) [16:19:02] (03PS1) 10Giuseppe Lavagetto: instrumentation: fixup for Ib0b3c139a [debs/pybal] - 10https://gerrit.wikimedia.org/r/264088 [16:20:21] ostriches: I guess you have to set +2 again :-/ [16:20:37] (03CR) 10Chad: [C: 032] Add exceptions for eswiki, eswikivoyage at 2016-01-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263414 (https://phabricator.wikimedia.org/T123261) (owner: 10Luke081515) [16:20:44] * ostriches mashes his +2 button [16:20:59] ^^ [16:20:59] (03Merged) 10jenkins-bot: Add exceptions for eswiki, eswikivoyage at 2016-01-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263414 (https://phabricator.wikimedia.org/T123261) (owner: 10Luke081515) [16:21:43] Bahhh, who has uncommitted changes to wikiversions.json? [16:22:19] * ostriches looks suspiciously at Krenair [16:23:09] (03PS1) 10Ottomata: Disable public access to YARN ResourceManager HTTP UI [puppet] - 10https://gerrit.wikimedia.org/r/264090 [16:23:33] ostriches: git diff [16:23:34] oh wait [16:23:36] it's minified [16:24:00] I stashed it. [16:24:20] (03CR) 10Ottomata: [C: 032] Disable public access to YARN ResourceManager HTTP UI [puppet] - 10https://gerrit.wikimedia.org/r/264090 (owner: 10Ottomata) [16:24:57] ostriches, I made a change to roll wikitech back to wmf.9 [16:24:59] temporarily [16:25:04] !log demon@tin Synchronized wmf-config/throttle.php: (no message) (duration: 00m 49s) [16:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:32] Krenair: I had to stash it (stash@{4}), please either commit or something :) [16:25:41] Luke081515: ^ throttle.php sync'd [16:26:09] thanks :) [16:26:21] Krenair: Yeah, commit it, even it's temporary [16:27:58] (03CR) 10Giuseppe Lavagetto: [C: 031] Fix apple-touch-icon.png on wikipedias [puppet] - 10https://gerrit.wikimedia.org/r/256437 (https://phabricator.wikimedia.org/T115965) (owner: 10Reedy) [16:29:14] (03CR) 10Giuseppe Lavagetto: [C: 031] Add apple-touch-icon.png to Wikidata, Wikinews and Wiktionary [puppet] - 10https://gerrit.wikimedia.org/r/256440 (owner: 10Reedy) [16:30:36] 6operations, 7Mail: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#1934528 (10faidon) I very much doubt we need the rest either: - I've never heard of sec-ops - I don't think anyone has used ops@wikimedia.org, ops-private@wikimedia.org or engineering@wikimedia.or... [16:32:00] (03CR) 10Giuseppe Lavagetto: [C: 031] "Seems sensible given there is in fact no dns setup" [puppet] - 10https://gerrit.wikimedia.org/r/256441 (https://phabricator.wikimedia.org/T120143) (owner: 10Reedy) [16:35:17] (03CR) 10Giuseppe Lavagetto: "I have not enough confidence with these scripts to merge this without someone else doing a code review." [puppet] - 10https://gerrit.wikimedia.org/r/263419 (https://phabricator.wikimedia.org/T123217) (owner: 10Alex Monk) [16:36:00] (03PS1) 10IoannisKydonis: Update Wikimedia's configuration after renaming $wgNetworkPerformanceSamplingFactor to $wgMediaViewerNetworkPerformanceSamplingFactor. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 [16:36:13] (03CR) 10Giuseppe Lavagetto: [C: 031] Gerrit: Remove old gitweb redirects, broken [puppet] - 10https://gerrit.wikimedia.org/r/263927 (owner: 10Chad) [16:36:34] <_joe_> Krenair: can you get someone with more knowledge of scap to review your change? [16:36:44] <_joe_> more than me I mean [16:36:47] _joe_, this isn't really part of scap itself... [16:36:55] not 100% sure why it's in the scap module [16:37:01] <_joe_> yeah, still I never used it [16:37:12] <_joe_> more than as a dull user from time to time [16:37:33] Krenair: link? I can take a look to help _joe_ out [16:37:39] who has root access and knowledge of scap? [16:37:58] Krenair: that would be ori mostly [16:38:14] bd808, it's https://gerrit.wikimedia.org/r/263419 [16:38:23] <_joe_> Krenair: I know how scap works, another thing is being certain a change won't break something subtle [16:41:53] Hm? [16:42:05] Now /everyone's/ looking :p [16:42:08] (03CR) 10Alexandros Kosiaris: [C: 032] Remove otrs-test.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/264086 (owner: 10Alexandros Kosiaris) [16:42:09] <_joe_> lol [16:42:39] (03CR) 10Alexandros Kosiaris: [C: 032] misc-web: Remove otrs-test routing rules [puppet] - 10https://gerrit.wikimedia.org/r/264087 (owner: 10Alexandros Kosiaris) [16:42:45] (03PS2) 10Alexandros Kosiaris: misc-web: Remove otrs-test routing rules [puppet] - 10https://gerrit.wikimedia.org/r/264087 [16:42:57] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] misc-web: Remove otrs-test routing rules [puppet] - 10https://gerrit.wikimedia.org/r/264087 (owner: 10Alexandros Kosiaris) [16:43:13] (03CR) 10Chad: [C: 031] "This is fine. Despite being in the scap module, most of the scripts here aren't scap related at all--see the angry comments in the manifes" [puppet] - 10https://gerrit.wikimedia.org/r/263419 (https://phabricator.wikimedia.org/T123217) (owner: 10Alex Monk) [16:43:41] (03CR) 10BryanDavis: [C: 031] "tested by copying the updated sql script to my homedir on terbium and connecting to a few wikis." [puppet] - 10https://gerrit.wikimedia.org/r/263419 (https://phabricator.wikimedia.org/T123217) (owner: 10Alex Monk) [16:43:45] (03CR) 10Alexandros Kosiaris: [C: 032] Add system::role to piwik role [puppet] - 10https://gerrit.wikimedia.org/r/264058 (owner: 10Alexandros Kosiaris) [16:43:51] (03PS2) 10Alexandros Kosiaris: Add system::role to piwik role [puppet] - 10https://gerrit.wikimedia.org/r/264058 [16:43:56] _joe_: `sql` is just a utility on tin/terbium for deployers to connect to prod dbs. [16:43:56] (03CR) 10Alexandros Kosiaris: [V: 032] Add system::role to piwik role [puppet] - 10https://gerrit.wikimedia.org/r/264058 (owner: 10Alexandros Kosiaris) [16:44:04] Nothing really *depends* on it [16:44:12] <_joe_> ostriches: yeah I'm reading it, BRRR [16:47:29] ostriches, also non-deployers [16:47:32] 'restricted' group [16:52:49] hi ebernhardson [16:54:33] Krenair: Eh, peeps with shell access. Main point was it's not a scap thingie but a human-usable util :) [16:56:38] 6operations, 6Performance-Team, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, and 3 others: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1934557 (10ori) p:5Unbreak!>3Normal [16:56:53] <_joe_> ori: heh, right, I forgot to do that :) [16:57:09] <_joe_> but I did what I promised, for once :) [16:58:03] _joe_: yes, looks great -- thank you! [16:58:47] hey ori, would you mind taking a look at https://gerrit.wikimedia.org/r/263419 ? [16:59:50] (03PS3) 10Ori.livneh: scap: Add wikishared (x1) support to sql command [puppet] - 10https://gerrit.wikimedia.org/r/263419 (https://phabricator.wikimedia.org/T123217) (owner: 10Alex Monk) [17:00:00] (03CR) 10Ori.livneh: [C: 032 V: 032] scap: Add wikishared (x1) support to sql command [puppet] - 10https://gerrit.wikimedia.org/r/263419 (https://phabricator.wikimedia.org/T123217) (owner: 10Alex Monk) [17:00:04] moritzm mutante: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160114T1700). Please do the needful. [17:00:04] Reedy Krenair ostriches: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:10] thanks [17:00:11] whee [17:01:02] should i force a puppet run on tin? [17:01:15] * ori does [17:01:18] that would be helpful to test it properly [17:02:01] <_joe_> ori: thanks :P [17:02:17] <_joe_> Reedy: I am preparing the tests for your patches, hold on 5 mins [17:02:26] Thanks! :) [17:03:26] 7Blocked-on-Operations, 6operations, 10RESTBase, 10procurement: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1934564 (10mark) @RobH: if we could get these SSDs quickly (perhaps), we might be able to save a lot of time on these migrations. Could you prioritize this ticke... [17:04:24] looks good [17:04:35] cool [17:04:40] yeah, ran puppet on tin and terbium both [17:04:45] thanks ori [17:04:53] * ori bbl [17:07:14] (03PS3) 10Giuseppe Lavagetto: Fix apple-touch-icon.png on wikipedias [puppet] - 10https://gerrit.wikimedia.org/r/256437 (https://phabricator.wikimedia.org/T115965) (owner: 10Reedy) [17:07:26] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix apple-touch-icon.png on wikipedias [puppet] - 10https://gerrit.wikimedia.org/r/256437 (https://phabricator.wikimedia.org/T115965) (owner: 10Reedy) [17:08:06] (03PS3) 10Giuseppe Lavagetto: Gerrit: Remove old gitweb redirects, broken [puppet] - 10https://gerrit.wikimedia.org/r/263927 (owner: 10Chad) [17:08:21] (03CR) 10Giuseppe Lavagetto: [V: 032] Fix apple-touch-icon.png on wikipedias [puppet] - 10https://gerrit.wikimedia.org/r/256437 (https://phabricator.wikimedia.org/T115965) (owner: 10Reedy) [17:09:17] (03PS3) 10Giuseppe Lavagetto: Add apple-touch-icon.png to Wikidata, Wikinews and Wiktionary [puppet] - 10https://gerrit.wikimedia.org/r/256440 (owner: 10Reedy) [17:09:29] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/256440 (owner: 10Reedy) [17:15:09] <_joe_> Reedy: https://en.wikidata.org/apple-touch-icon.png still gives 404 [17:15:13] <_joe_> but for the rest, lgtm [17:15:31] _joe_: en.wikidata? :P [17:15:40] https://www.wikidata.org/apple-touch-icon.png [17:15:55] <_joe_> err yes shit [17:16:01] <_joe_> c=p fail [17:16:10] 'wikidata' => '/static/apple-touch/wikidata.png', // T72996 [17:16:13] They've one defined [17:16:40] <_joe_> still gives 404 [17:16:50] <_joe_> let me look into the static dir [17:17:03] https://github.com/wikimedia/operations-mediawiki-config/blob/master/w/static/apple-touch/wikidata.png [17:17:09] The file is there in the repo at least [17:17:42] <_joe_> the file is not on mw1017 [17:17:50] <_joe_> afaict [17:18:01] hmm [17:18:11] It looks like I added it to test.wikidata, not wikidata [17:18:33] Why didn't I do wikidata? [17:18:42] I'll make a patch quickly [17:18:44] <_joe_> sorry, it's there, and yes, I was about to say :) [17:18:46] <_joe_> ok [17:19:08] <_joe_> I'll go on with your other patch, effects should be felt in ~20 minutes [17:19:17] https://test.wikidata.org/apple-touch-icon.png does work [17:22:02] (03PS1) 10Reedy: Actually add apple-touch-icon.png to Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/264095 [17:22:41] (03CR) 10Reedy: "Noting this actually added it to testwikidatawiki not wikidatawiki" [puppet] - 10https://gerrit.wikimedia.org/r/256440 (owner: 10Reedy) [17:23:12] 6operations, 5Patch-For-Review, 7Regression: [Regression] 404 Not Found: https://en.wikipedia.org/apple-touch-icon.png - https://phabricator.wikimedia.org/T115965#1934608 (10Reedy) 5Open>3Resolved a:3Reedy [17:24:10] 7Puppet, 6operations: puppet compiler runs fail when backup::host is included on host - https://phabricator.wikimedia.org/T122909#1934610 (10akosiaris) That's the module operator and is meant to make sure we get something between 0 and 6. The problem is that the module operator can not be applied to nil which... [17:25:08] (03CR) 10Giuseppe Lavagetto: [C: 032] Actually add apple-touch-icon.png to Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/264095 (owner: 10Reedy) [17:25:24] 1 file changed, 1 insertion(+), 8 deletions(-) [17:25:30] I love stuff like that [17:25:37] Remove stuff, make it easier to read, unbreak stuff too [17:26:34] <_joe_> there are a lot of low-hanging fruits in our apache config [17:26:44] Yeah [17:26:52] <_joe_> and if I dare to think how much cruft we already removed... [17:27:23] I think we can get rid of most of the rest of the docroots too [17:27:34] Allowing much more common code to be refactored out [17:27:44] shame I can't get anyone to +2 my commit to it [17:27:44] * _joe_ nod [17:27:56] <_joe_> Krenair: which one? [17:28:33] https://gerrit.wikimedia.org/r/#/c/244237/ [17:28:34] presumably [17:28:34] only one I have open at the moment is https://gerrit.wikimedia.org/r/#/c/244237/ [17:29:05] <_joe_> Reedy: I'm going to merge 256441 now [17:29:06] * ostriches mashes his +1 button [17:29:13] Thanks! [17:29:16] (03PS4) 10Giuseppe Lavagetto: Remove www.([a-z-]+) rewrites [puppet] - 10https://gerrit.wikimedia.org/r/256441 (https://phabricator.wikimedia.org/T120143) (owner: 10Reedy) [17:29:19] (03CR) 10Chad: [C: 031] "duh." [puppet] - 10https://gerrit.wikimedia.org/r/244237 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [17:29:46] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/256441 (https://phabricator.wikimedia.org/T120143) (owner: 10Reedy) [17:30:16] (03PS1) 10Muehlenhoff: Don't automatically update openssh-client [puppet] - 10https://gerrit.wikimedia.org/r/264096 [17:31:10] <_joe_> Reedy: actually, this is strange, uhm [17:31:46] ? [17:31:49] <_joe_> Reedy: that rewrite for www.m does work on apache [17:32:03] <_joe_> but it's varnish that decides to serve the desktop wikipedia in that case... [17:32:08] Haha [17:32:31] <_joe_> so no effect anyways, let's go [17:32:54] 6operations, 7Mobile, 5Patch-For-Review: Investigate if www.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T120143#1934640 (10Reedy) ``` [17:31:10] <_joe_> Reedy: actually, this is strange, uhm [17:31:46] ? [17:31:49] <_joe_> Reedy: that rewrite for www.m does work on apache... [17:32:55] I guess if they want it to stay, they'll need ops involvement to fix it anyway [17:33:08] <_joe_> yes [17:33:25] <_joe_> if they want it to /work/ [17:34:18] <_joe_> ostriches: almost your turn [17:34:32] Yippie! [17:35:48] (03PS4) 10Giuseppe Lavagetto: Gerrit: Remove old gitweb redirects, broken [puppet] - 10https://gerrit.wikimedia.org/r/263927 (owner: 10Chad) [17:36:07] <_joe_> I just have one question: do you think this merits any announcement? [17:36:14] <_joe_> or we just turn this off? [17:36:58] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/263927 (owner: 10Chad) [17:38:41] _joe_: Meh. 90% traffic for last month was a bunch of broken bots pretending to be bing and google who provided no contact info [17:38:46] lol [17:38:46] <_joe_> ostriches: running the change now :) [17:38:59] <_joe_> the remaining 10% was you testing? [17:39:18] _joe_: Mostly :p [17:39:25] <_joe_> ostriches: {{done}} [17:39:51] wheeee [17:41:23] (03CR) 10Giuseppe Lavagetto: [C: 031] "Seems correct to me as well." [puppet] - 10https://gerrit.wikimedia.org/r/244237 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [17:42:00] <_joe_> Krenair: I'm going off now, maybe schedule that change for tuesday's PuppetSWAT? [17:42:24] _joe_, ok [17:44:02] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:43] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 36.40 ms [17:49:56] legoktm, want to re-schedule https://gerrit.wikimedia.org/r/#/c/237686/ ? [17:50:02] ebernhardson, want to re-schedule https://gerrit.wikimedia.org/r/#/c/263988/ ? [17:58:34] 6operations, 10ops-eqiad: cr1-eqiad new patch zayo transit connection - https://phabricator.wikimedia.org/T123574#1934732 (10Cmjohnson) 5Open>3Resolved Complete, cable number is 3482 and description has been updated on router Zayo (IPYX/125449/001/ZYO) {3482} [10Gbps] no-mon [18:01:57] !log turning up BGP with Zayo in eqiad [18:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:07:05] Krenair: wmf9 is undeployed tomorrow so will just wait i think [18:07:41] anyone know where etcd debian packaging is? was expecting in gerrit operations/debs/??? but no [18:09:15] ebernhardson: if I'm asking teh same question I would ask _joe_ :) [18:11:24] chasemp: makes sense, although i bet hes jet lagged atm [18:12:18] (03PS2) 10IoannisKydonis: Update Wikimedia's configuration after renaming $wgNetworkPerformanceSamplingFactor to $wgMediaViewerNetworkPerformanceSamplingFactor. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 [18:12:43] (03PS1) 10Alex Monk: Commit change to roll wikitech back to wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264099 [18:13:35] (03CR) 10Alex Monk: [C: 032] Commit change to roll wikitech back to wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264099 (owner: 10Alex Monk) [18:13:43] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:13:54] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1934748 (10Cmjohnson) The server is out of warranty but I have several spare DIMM for the R610's on-site. It appears that DIMM A3 is bad and needs to be replaced. I will need abou... [18:13:58] (03Merged) 10jenkins-bot: Commit change to roll wikitech back to wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264099 (owner: 10Alex Monk) [18:14:11] (03PS2) 10Dzahn: varnish/misc-web: enable caching for some static sites [puppet] - 10https://gerrit.wikimedia.org/r/263650 [18:15:09] (03CR) 10Dzahn: [C: 032] varnish/misc-web: enable caching for some static sites [puppet] - 10https://gerrit.wikimedia.org/r/263650 (owner: 10Dzahn) [18:16:45] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1934758 (10ArielGlenn) >>! In T115287#1934096, @MoritzMuehlenhoff wrote: > It's working well for me, I ran salt commands for the entire cluster using host-based '*' matching (both using batches... [18:17:13] PROBLEM - RAID on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:17:44] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1934760 (10Nuria) @ottomata: can you coordinate a 5 minutes outage today? [18:18:13] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:18:42] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:18:43] PROBLEM - configured eth on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:19:22] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:20:22] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:20:34] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [18:20:44] RECOVERY - configured eth on mw1013 is OK: OK - interfaces up [18:21:13] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [18:21:22] RECOVERY - RAID on mw1013 is OK: OK: no RAID installed [18:21:40] moritzm: I want to set up an openldap server in the testlabs cluster. If I reuse the openldap::labs will it automatically start syncing with the production cluster and messing things up? [18:22:26] (In a perfect world I’d get a one-time copy of the production data but never sync again) [18:23:40] (03PS1) 10Cmjohnson: Removing mgmt dns entries for calcium server....decommissioned [dns] - 10https://gerrit.wikimedia.org/r/264100 [18:24:21] 6operations, 10ops-eqiad, 10hardware-requests: Decommission calcium - https://phabricator.wikimedia.org/T116790#1934798 (10Cmjohnson) [18:26:40] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1934804 (10Dzahn) [18:26:41] 6operations, 7Mail: remove exim alias - rob - https://phabricator.wikimedia.org/T123602#1934802 (10Dzahn) 5Open>3Resolved done ``` -#Rob Halsell - whoever removed this stop doing it damn it -rob: rhalsell -robh: rob - ``` [18:29:44] (03PS2) 10Cmjohnson: Removing mgmt dns entries for calcium server....decommissioned [dns] - 10https://gerrit.wikimedia.org/r/264100 [18:31:31] 6operations, 7Mail: remove exim alias keynote@ - https://phabricator.wikimedia.org/T123646#1934830 (10Dzahn) 3NEW a:3Dzahn [18:33:33] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1934839 (10RobH) [18:33:34] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:33:42] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1934841 (10Ottomata) Eeee, I'm not so sure. Are we sure eventlogging is the only user of m4-master? [18:34:25] 6operations, 10ops-eqiad, 10hardware-requests: Decommission calcium - https://phabricator.wikimedia.org/T116790#1934848 (10Cmjohnson) Wiped - awaiting approval to decommission [18:34:33] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1882492 (10RobH) pushed back to public, please don't append #procurement to public tasks, use #hardware-requests (gabriel and i are sitting less than 5 fee... [18:34:45] 6operations, 10hardware-requests: migrate spares into google sheet tracking & determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1934851 (10Cmjohnson) [18:34:46] 6operations, 10ops-eqiad, 10hardware-requests: Decommission calcium - https://phabricator.wikimedia.org/T116790#1934850 (10Cmjohnson) [18:35:09] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for calcium server....decommissioned [dns] - 10https://gerrit.wikimedia.org/r/264100 (owner: 10Cmjohnson) [18:35:25] (03PS2) 10Andrew Bogott: Add openldap to labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/264053 [18:36:00] (03PS2) 10BBlack: eqiad misc-web addr fixes: 4/5 remove old addrs from LVS/caches [puppet] - 10https://gerrit.wikimedia.org/r/263613 (https://phabricator.wikimedia.org/T83110) [18:39:28] !log removing old eqiad misc-web IP (DNS switched 26h ago, TTLs are max 1h) [18:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:43] 6operations, 10ops-eqiad, 10hardware-requests: wipe neptunium and add back to spares - https://phabricator.wikimedia.org/T122101#1934875 (10Cmjohnson) Wiped and awaiting approval on decommissioning (Task T120679) [18:39:51] (03CR) 10BBlack: [C: 032] eqiad misc-web addr fixes: 4/5 remove old addrs from LVS/caches [puppet] - 10https://gerrit.wikimedia.org/r/263613 (https://phabricator.wikimedia.org/T83110) (owner: 10BBlack) [18:40:12] !log removing old eqiad misc-web IP (DNS switched 50h ago (not 26 like above), TTLs are max 1h) [18:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:40:45] (03CR) 10Gergő Tisza: [C: 04-1] Update Wikimedia's configuration after renaming $wgNetworkPerformanceSamplingFactor to $wgMediaViewerNetworkPerformanceSamplingFactor. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 (owner: 10IoannisKydonis) [18:41:14] (03PS3) 10Andrew Bogott: Add openldap to labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/264053 [18:44:19] (03PS4) 10Andrew Bogott: Add openldap to labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/264053 [18:46:43] (03CR) 10Andrew Bogott: [C: 032] Add openldap to labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/264053 (owner: 10Andrew Bogott) [18:48:33] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1897598 (10Dzahn) [18:48:34] 6operations, 7Mail: remove exim alias -- eekim - https://phabricator.wikimedia.org/T123572#1934913 (10Dzahn) 5Open>3Resolved done ``` -# Eugene Erik Kim -ekim: eekim@blueoxen.com -eekim: ekim - ``` [18:49:30] (03PS3) 10BBlack: eqiad misc-web addr fixes: 5/5 remove old reverse DNS [dns] - 10https://gerrit.wikimedia.org/r/263611 (https://phabricator.wikimedia.org/T83110) [18:50:34] (03CR) 10BBlack: [C: 032] eqiad misc-web addr fixes: 5/5 remove old reverse DNS [dns] - 10https://gerrit.wikimedia.org/r/263611 (https://phabricator.wikimedia.org/T83110) (owner: 10BBlack) [18:51:08] (03PS1) 10Andrew Bogott: Add firewall to labtestservices2001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/264103 [18:52:20] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1934933 (10Dzahn) [18:52:22] 6operations, 7Mail: remove exim alias - corissa - https://phabricator.wikimedia.org/T123578#1934931 (10Dzahn) 5Open>3Resolved done ``` -#Corissa E-mail spelling fix -chauskencht: chausknecht - ``` [18:52:40] (03CR) 10Andrew Bogott: [C: 032] Add firewall to labtestservices2001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/264103 (owner: 10Andrew Bogott) [18:55:19] andrewbogott: I just powered off labsdb1006 accidentally [18:55:50] (03PS3) 10IoannisKydonis: Update Wikimedia's configuration after renaming $wgNetworkPerformanceSamplingFactor to $wgMediaViewerNetworkPerformanceSamplingFactor. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 [18:56:12] (03CR) 10jenkins-bot: [V: 04-1] Update Wikimedia's configuration after renaming $wgNetworkPerformanceSamplingFactor to $wgMediaViewerNetworkPerformanceSamplingFactor. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 (owner: 10IoannisKydonis) [18:56:29] cmjohnson1: good to know :) I think that’s used for maps things, hopefully it will recover before anyone cares. [18:56:52] PROBLEM - Host labsdb1006 is DOWN: PING CRITICAL - Packet loss = 100% [18:58:04] (03PS4) 10IoannisKydonis: Update Wikimedia's configuration after renaming $wgNetworkPerformanceSamplingFactor to $wgMediaViewerNetworkPerformanceSamplingFactor. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 [18:58:23] RECOVERY - Host labsdb1006 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [19:00:04] marxarelli: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160114T1900). [19:00:40] (03PS1) 10Dduvall: all wikis to 1.27.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264106 [19:00:44] andrewbogott: no, you should be able to use it independantly, it only syncs if a $master is configured [19:00:56] moritzm: ok, thanks [19:01:11] moritzm: I forked the role for now, but maybe I can roll things back together later [19:01:18] andrewbogott: if off for dinner, I can review a patch later if you want [19:01:41] (03CR) 10Dduvall: [C: 032] all wikis to 1.27.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264106 (owner: 10Dduvall) [19:01:50] do you want an identical test setup compared to the production server or something different? [19:02:04] (03Merged) 10jenkins-bot: all wikis to 1.27.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264106 (owner: 10Dduvall) [19:02:09] 6operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#1934972 (10Cmjohnson) [19:02:11] 6operations, 10ops-eqiad, 10Gerrit, 10hardware-requests: check for memory upgrade for lead - https://phabricator.wikimedia.org/T123531#1934970 (10Cmjohnson) 5Open>3Resolved I had extra DIMM on-site and added to lead. Total Memory is 32GB. [19:02:45] !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.27.0-wmf.10 [19:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:03:03] 6operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#1921926 (10Cmjohnson) I was able to increase the amount of memory on lead by 16GB for a total of 32GB. [19:03:05] (03PS1) 10Andrew Bogott: Add a placeholder mw_appserver_networks for labtest [puppet] - 10https://gerrit.wikimedia.org/r/264107 [19:05:26] (03CR) 10Andrew Bogott: [C: 032] Add a placeholder mw_appserver_networks for labtest [puppet] - 10https://gerrit.wikimedia.org/r/264107 (owner: 10Andrew Bogott) [19:06:52] PROBLEM - puppet last run on wtp2006 is CRITICAL: CRITICAL: puppet fail [19:11:00] 6operations, 10ops-eqiad, 10hardware-requests: reclaim rubidium to spares - https://phabricator.wikimedia.org/T118213#1934996 (10Cmjohnson) Disk wiped. changing this to decommission ticket...blocked by task T120679 [19:11:22] 6operations, 10hardware-requests: migrate spares into google sheet tracking & determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1858826 (10Cmjohnson) [19:11:23] 6operations, 10ops-eqiad, 10hardware-requests: reclaim rubidium to spares - https://phabricator.wikimedia.org/T118213#1934998 (10Cmjohnson) [19:11:38] 6operations, 10ops-eqiad, 10hardware-requests: Decommission rubidium - https://phabricator.wikimedia.org/T118213#1935011 (10Cmjohnson) [19:12:03] 6operations, 10hardware-requests: migrate spares into google sheet tracking & determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1858826 (10Cmjohnson) [19:12:05] 6operations, 10ops-eqiad, 10hardware-requests: wipe neptunium and add back to spares - https://phabricator.wikimedia.org/T122101#1935012 (10Cmjohnson) [19:12:16] 6operations, 10ops-eqiad, 10hardware-requests: Decommission neptunium - https://phabricator.wikimedia.org/T122101#1935014 (10Cmjohnson) [19:14:42] marxarelli: congrats, btw :) [19:14:59] greg-g: thanks :) [19:15:03] next week will be more real, but baby steps :) [19:15:07] 6operations, 10ops-eqiad, 5Patch-For-Review: Decommission plutonium - https://phabricator.wikimedia.org/T118586#1935023 (10Cmjohnson) [19:15:56] (03PS5) 10Gergő Tisza: Update Wikimedia's configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 (owner: 10IoannisKydonis) [19:16:07] (03CR) 10Gergő Tisza: [C: 031] Update Wikimedia's configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 (owner: 10IoannisKydonis) [19:16:44] 6operations, 10ops-eqiad, 5Patch-For-Review: Decommission plutonium - https://phabricator.wikimedia.org/T118586#1804229 (10Cmjohnson) Disks are wiped...waiting on decommission approval task T120679 [19:16:53] 6operations, 10hardware-requests: migrate spares into google sheet tracking & determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1858826 (10Cmjohnson) [19:16:54] 6operations, 10ops-eqiad, 5Patch-For-Review: Decommission plutonium - https://phabricator.wikimedia.org/T118586#1935029 (10Cmjohnson) [19:17:03] greg-g: got pretty much through the deploy on tuesday so i feel confident enough to break stuff next week [19:17:16] 6operations, 10ops-eqiad, 10hardware-requests: Decommission calcium - https://phabricator.wikimedia.org/T116790#1935032 (10Cmjohnson) p:5Normal>3Low [19:17:25] gotta earn a tshirt at some point [19:17:28] marxarelli: Woo. [19:17:29] 6operations, 10ops-eqiad, 10hardware-requests: Decommission neptunium - https://phabricator.wikimedia.org/T122101#1935033 (10Cmjohnson) p:5Normal>3Low [19:17:33] +1 [19:17:38] marxarelli: :) :) [19:17:46] 6operations, 10ops-eqiad, 10hardware-requests: Decommission rubidium - https://phabricator.wikimedia.org/T118213#1935034 (10Cmjohnson) p:5Normal>3Low [19:17:50] Deploy t-shirts for everyone. [19:18:26] 7Blocked-on-Operations, 6operations, 10ops-eqiad: reclaim erbium, gadolinium into spares - https://phabricator.wikimedia.org/T123029#1935036 (10Cmjohnson) a:3Cmjohnson [19:20:38] 6operations, 7Mail: remove or update ea@ alias? - https://phabricator.wikimedia.org/T123286#1935044 (10JKrauska) will convert to google group with coordination of user(s) [19:21:52] marxarelli, we just deliberately moved wikitech back to wmf.9 [19:21:55] you reverted it? [19:22:42] Krenair: scheiße. i did [19:23:15] Krenair: it's on wmf.10 that is [19:23:19] it's git-stashed [19:23:22] still, right? [19:23:23] no [19:23:25] I committed it [19:23:42] PROBLEM - Host erbium is DOWN: PING CRITICAL - Packet loss = 100% [19:24:40] andrewbogott, Reedy ^ [19:24:44] Krenair: so, i should downgrade the wikitech version and resync? [19:24:48] yup [19:25:06] yes please [19:25:26] dur... what's wikitech's id again? [19:25:29] labswiki [19:25:35] got it [19:26:02] the mis-matching db prefixes are set up in MWMultiversion::setSiteInfoForWiki [19:26:21] (03PS6) 10Gergő Tisza: Update MediaViewer configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 (owner: 10IoannisKydonis) [19:26:25] MWMultiVersion::setSiteInfoForWiki* [19:27:34] (03PS1) 10BBlack: Add new geoiplookup IPs to DNS, start using them [dns] - 10https://gerrit.wikimedia.org/r/264111 (https://phabricator.wikimedia.org/T121922) [19:27:39] Krenair: site.pp says erbium 'is currently spare.' [19:27:46] I Don’t know who’s working on it though [19:28:00] oh, wait, you were pointing higher up [19:28:04] (03PS1) 10BBlack: text LVS: add new IPv4-only for geoiplookup [puppet] - 10https://gerrit.wikimedia.org/r/264112 (https://phabricator.wikimedia.org/T121922) [19:28:11] (03PS1) 10Dduvall: Rollback wikitech (labswiki) back to wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264113 [19:29:13] (03CR) 10Dduvall: [C: 032] Rollback wikitech (labswiki) back to wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264113 (owner: 10Dduvall) [19:29:52] (03Merged) 10jenkins-bot: Rollback wikitech (labswiki) back to wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264113 (owner: 10Dduvall) [19:31:28] !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: rollback labswiki to wmf.9 [19:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:35] Krenair: k. Special:Version looks right [19:32:45] thanks marxarelli [19:34:03] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:34:19] i went to double check the enwiki version on Special:Version and didn't notice that chrome's history autocompleted and sent me to beta. seeing 1.27alpha almost gave me a heart attack :) [19:34:43] (03PS7) 10Gergő Tisza: Update MediaViewer configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 (owner: 10IoannisKydonis) [19:34:54] (03PS8) 10Gergő Tisza: Update MediaViewer configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264091 (owner: 10IoannisKydonis) [19:37:41] (03CR) 10BBlack: [C: 032] text LVS: add new IPv4-only for geoiplookup [puppet] - 10https://gerrit.wikimedia.org/r/264112 (https://phabricator.wikimedia.org/T121922) (owner: 10BBlack) [19:38:04] (03PS1) 10Ottomata: Batch eventlogging_sync custom mysqldump replication [puppet] - 10https://gerrit.wikimedia.org/r/264115 (https://phabricator.wikimedia.org/T123634) [19:38:34] marxarelli: relatedly, I only go to enwiki in my work browser profile when things are broken: http://i.imgur.com/P2Zf1F6.png [19:38:49] haha [19:40:49] (03CR) 10Ottomata: [C: 032] Batch eventlogging_sync custom mysqldump replication [puppet] - 10https://gerrit.wikimedia.org/r/264115 (https://phabricator.wikimedia.org/T123634) (owner: 10Ottomata) [19:44:01] 7Blocked-on-Operations, 6operations, 10ops-eqiad: reclaim erbium, gadolinium into spares - https://phabricator.wikimedia.org/T123029#1935121 (10Cmjohnson) added to spares list...not wiped [19:44:37] 6operations, 10ops-eqiad: Decommission cp1037-1040 - https://phabricator.wikimedia.org/T83553#1935122 (10Cmjohnson) [19:47:48] (03PS1) 10Andrew Bogott: Add server_id arg for labtest openldap [puppet] - 10https://gerrit.wikimedia.org/r/264117 [19:49:56] (03CR) 10Andrew Bogott: [C: 032] Add server_id arg for labtest openldap [puppet] - 10https://gerrit.wikimedia.org/r/264117 (owner: 10Andrew Bogott) [19:55:19] !log restarted eventlogging_sync script to insert batches of 1000 [19:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:56:20] (03CR) 10Yuvipanda: wikimetrics: Puppet module for wikimetrics (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260687 (https://phabricator.wikimedia.org/T101763) (owner: 10Madhuvishy) [19:58:27] (03CR) 10BBlack: [C: 032] Add new geoiplookup IPs to DNS, start using them [dns] - 10https://gerrit.wikimedia.org/r/264111 (https://phabricator.wikimedia.org/T121922) (owner: 10BBlack) [20:02:00] 6operations, 10Fundraising-Backlog, 10Traffic, 10Unplanned-Sprint-Work, and 2 others: Firefox SPDY-coalesces requests to geoiplookup over text-lb, causing GeoIP IPv6 failures - https://phabricator.wikimedia.org/T121922#1935151 (10BBlack) I've implemented the "separate IP" fix for now so we can get past thi... [20:02:59] (03PS1) 10Ottomata: Add dependency to eventlogging_sync.sh to ensure it is in place before service is started [puppet] - 10https://gerrit.wikimedia.org/r/264121 [20:03:04] mutante: btw, your ores check icinga check worked perfectly! [20:03:12] morebots: alerted us in time and we managed to fix it quickl [20:03:13] I am a logbot running on tools-exec-1214. [20:03:13] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [20:03:13] To log a message, type !log . [20:03:15] y [20:03:17] thanks a lot! [20:03:18] bah, mutante, not morebots [20:03:21] (03CR) 10Ottomata: [C: 032 V: 032] Add dependency to eventlogging_sync.sh to ensure it is in place before service is started [puppet] - 10https://gerrit.wikimedia.org/r/264121 (owner: 10Ottomata) [20:04:15] YuviPanda: :) [20:07:14] 6operations, 7Mobile, 5Patch-For-Review: Investigate if www.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T120143#1935166 (10Dzahn) So if it's been removed from Apache now, do we delete it from DNS? [20:10:26] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1935169 (10Dzahn) [20:10:27] 6operations, 7Mail: remove exim alias - aradhana - https://phabricator.wikimedia.org/T123576#1935167 (10Dzahn) 5Open>3Resolved done ``` -# aradhana -aradhana: aravindra - ``` [20:11:12] (03PS1) 10Alex Monk: Add my yubikey [puppet] - 10https://gerrit.wikimedia.org/r/264125 [20:11:19] (03CR) 10Madhuvishy: wikimetrics: Puppet module for wikimetrics (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260687 (https://phabricator.wikimedia.org/T101763) (owner: 10Madhuvishy) [20:11:33] (03PS11) 10Madhuvishy: wikimetrics: Puppet module for wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/260687 (https://phabricator.wikimedia.org/T101763) [20:14:35] (03PS1) 10Ottomata: Order by timestamp instead of by uuid. UUID is an unordered string anyway. [puppet] - 10https://gerrit.wikimedia.org/r/264126 [20:15:01] (03CR) 10Ottomata: [C: 032 V: 032] Order by timestamp instead of by uuid. UUID is an unordered string anyway. [puppet] - 10https://gerrit.wikimedia.org/r/264126 (owner: 10Ottomata) [20:22:05] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:24] PROBLEM - puppet last run on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:22:25] PROBLEM - HHVM rendering on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:45] PROBLEM - nutcracker port on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:05] PROBLEM - RAID on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:24] PROBLEM - configured eth on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:25] PROBLEM - DPKG on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:35] PROBLEM - dhclient process on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:54] PROBLEM - Check size of conntrack table on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:54] PROBLEM - SSH on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:56] PROBLEM - HHVM processes on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:24:05] PROBLEM - nutcracker process on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:24:05] PROBLEM - Disk space on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:24:45] PROBLEM - salt-minion processes on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:26:46] RECOVERY - salt-minion processes on mw1129 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:26:55] RECOVERY - nutcracker port on mw1129 is OK: TCP OK - 0.000 second response time on port 11212 [20:27:24] RECOVERY - configured eth on mw1129 is OK: OK - interfaces up [20:27:25] RECOVERY - DPKG on mw1129 is OK: All packages OK [20:27:44] RECOVERY - dhclient process on mw1129 is OK: PROCS OK: 0 processes with command name dhclient [20:27:54] RECOVERY - Check size of conntrack table on mw1129 is OK: OK: nf_conntrack is 0 % full [20:27:55] RECOVERY - SSH on mw1129 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [20:28:04] RECOVERY - HHVM processes on mw1129 is OK: PROCS OK: 6 processes with command name hhvm [20:28:05] RECOVERY - nutcracker process on mw1129 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [20:28:06] RECOVERY - Disk space on mw1129 is OK: DISK OK [20:29:14] RECOVERY - RAID on mw1129 is OK: OK: no RAID installed [20:36:09] (03PS12) 10Madhuvishy: wikimetrics: Puppet module for wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/260687 (https://phabricator.wikimedia.org/T101763) [20:36:43] (03CR) 10Yuvipanda: [C: 032 V: 032] "Another one bites the dust!" [puppet] - 10https://gerrit.wikimedia.org/r/260687 (https://phabricator.wikimedia.org/T101763) (owner: 10Madhuvishy) [20:37:16] yay [20:43:22] (03PS1) 10Yuvipanda: Revert "wikimetrics: Puppet module for wikimetrics" [puppet] - 10https://gerrit.wikimedia.org/r/264130 [20:43:30] (03CR) 10Yuvipanda: [C: 032] Revert "wikimetrics: Puppet module for wikimetrics" [puppet] - 10https://gerrit.wikimedia.org/r/264130 (owner: 10Yuvipanda) [20:44:07] (03CR) 10Yuvipanda: [V: 032] Revert "wikimetrics: Puppet module for wikimetrics" [puppet] - 10https://gerrit.wikimedia.org/r/264130 (owner: 10Yuvipanda) [20:48:35] (03PS1) 10Madhuvishy: Remove submodule wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/264133 [20:49:14] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:49:15] YuviPanda: ^ [20:50:39] (03CR) 10Yuvipanda: [C: 032] "Let's try this again, with less hubris." [puppet] - 10https://gerrit.wikimedia.org/r/264133 (owner: 10Madhuvishy) [20:55:47] (03PS1) 10Yuvipanda: Revert "Revert "wikimetrics: Puppet module for wikimetrics"" [puppet] - 10https://gerrit.wikimedia.org/r/264134 [20:55:57] madhuvishy: ^ [20:55:59] it is :D [20:56:24] mmm hmmm [20:56:28] this is a new patch [20:56:32] madhuvishy: yeah [20:56:33] * madhuvishy is very confused [20:56:35] madhuvishy: reverts are always new patches [20:56:47] so this needs a manual rebase [20:56:50] and then we can merge it [20:56:53] so i'll checkout this patch? [20:56:56] yeah [20:56:58] and do [20:57:00] 'git fetch' [20:57:05] git rebase ori production [20:57:08] err [20:57:10] origin/production [20:57:18] and see what happens [20:58:30] <_joe_> ori: how does it feel to be rebased onto production? [20:58:47] First we have an R wrapper for Yuvi [20:58:50] _joe_: we can play him too [20:59:02] Now we're rebasing Ori onto production [20:59:02] I'm sensing a pattern [20:59:49] (03PS2) 10Madhuvishy: Wikimetrics: Puppet module for wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/264134 (https://phabricator.wikimedia.org/T101763) (owner: 10Yuvipanda) [21:00:06] YuviPanda: may be ^? [21:00:29] madhuvishy: yup, that looks good to me [21:00:31] YuviPanda: it's your module now :D [21:00:39] madhuvishy: no, I'll reset author info :P [21:01:04] (03PS4) 10Subramanya Sastry: Add parsoid::testing role and use it on ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/264024 [21:01:06] (03PS2) 10Subramanya Sastry: WIP: Add the visualdiff module; instantiate psd visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 [21:01:10] madhuvishy: now I need to manually remove the modules/wikimetrics/* from all self hosted puppetmasters before merging this patch [21:01:20] this is why we hate submodules (and self hosted puppetmasters) [21:01:26] elukey: ^ :) [21:01:44] YuviPanda: oh [21:01:55] you will probably break wikimetrics prod then [21:02:08] madhuvishy: it's already broken no? [21:02:41] YuviPanda: no - the site is up - but its probably okay [21:02:42] madhuvishy: you can do 'service salt-minion stop' [21:02:52] madhuvishy: to prevent that instance from being reachable by salt for sure [21:02:57] madhuvishy: but that fucks up the instance more [21:03:02] 7Blocked-on-Operations, 10Dumps-Generation, 10Flow, 3Collaboration-Team-Current, 5Patch-For-Review: Publish recurring Flow dumps at http://dumps.wikimedia.org/ - https://phabricator.wikimedia.org/T119511#1935280 (10ArielGlenn) From discussion on IRC with Reedy, Mattflaschen: Let's get the change https:/... [21:03:31] YuviPanda: checking with Dan on analytics [21:03:37] madhuvishy: ok [21:13:47] 6operations, 10Gerrit, 10GitHub-Mirrors, 10ValueView, 10Wikidata: [Task] Redirect unused extensions/ValueView repository to data-values/value-view - https://phabricator.wikimedia.org/T123624#1935320 (10JanZerebecki) [21:16:51] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1935339 (10Dzahn) [21:16:53] 6operations, 7Mail: remove exim alias keynote@ - https://phabricator.wikimedia.org/T123646#1935337 (10Dzahn) 5Open>3Resolved done ``` -keynote: mark, rob, ariel - ``` [21:21:31] YuviPanda: this is why I was asking if there was a "suggested/consolidated" way to to these things in Labs [21:23:24] Otherwise each time you have 50% of chance of causing rash and anger to an opsen [21:23:28] :) [21:23:34] :P [21:25:23] 6operations, 7Mail: remove exim alias feedbacktest@ - https://phabricator.wikimedia.org/T123665#1935351 (10Dzahn) 3NEW a:3Dzahn [21:27:13] 6operations, 7Mail: delete exim alias wikilibrary@ library@ - https://phabricator.wikimedia.org/T123666#1935363 (10Dzahn) 3NEW a:3Dzahn [21:27:23] 6operations, 7Mail: delete exim alias wikilibrary@ library@ - https://phabricator.wikimedia.org/T123666#1935363 (10Dzahn) a:5Dzahn>3JKrauska [21:27:48] 6operations, 7Mail: delete exim alias wikilibrary@ library@ - https://phabricator.wikimedia.org/T123666#1935363 (10Dzahn) [21:28:53] 6operations, 7Mail: delete exim alias vpe-staff: eng-mgt - https://phabricator.wikimedia.org/T123667#1935378 (10Dzahn) 3NEW a:3Dzahn [21:29:10] 6operations, 7Mail: delete exim alias vpe-staff: eng-mgt - https://phabricator.wikimedia.org/T123667#1935378 (10Dzahn) a:5Dzahn>3JKrauska [21:31:05] 6operations, 7Mail: remove/update office@wikipedia.org alias - https://phabricator.wikimedia.org/T123669#1935398 (10Dzahn) 3NEW a:3Dzahn [21:31:24] 6operations, 7Mail: remove/update office@wikipedia.org alias - https://phabricator.wikimedia.org/T123669#1935398 (10Dzahn) a:5Dzahn>3JKrauska [21:33:23] 6operations, 7Mail: remove wikibugs-irc mail alias ? - https://phabricator.wikimedia.org/T123432#1935408 (10Dzahn) I just did the same removal on the **wikiPedia.org** file, it existed there as well separately. ``` -# Bug mail to IRC bridge -wikibugs-irc: |/usr/local/bin/wikibugs.pl - ``` [21:35:05] 6operations, 7Mail: remove staff@wikipedia.org - https://phabricator.wikimedia.org/T123670#1935409 (10Dzahn) 3NEW a:3Dzahn [21:35:20] 6operations, 7Mail: remove staff@wikipedia.org - https://phabricator.wikimedia.org/T123670#1935409 (10Dzahn) a:5Dzahn>3JKrauska [21:35:35] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:38:59] 6operations, 7Mail: remove shop@wikipedia.org -> board@wikipedia.org - https://phabricator.wikimedia.org/T123672#1935434 (10Dzahn) 3NEW a:3Dzahn [21:47:08] 6operations: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#1935454 (10Dzahn) 3NEW a:3Dzahn [21:48:41] 6operations: reinstall redis servers with jessie - https://phabricator.wikimedia.org/T123675#1935461 (10Dzahn) 3NEW [21:49:28] (03PS3) 10Yuvipanda: Wikimetrics: Puppet module for wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/264134 (https://phabricator.wikimedia.org/T101763) [21:50:09] madhuvishy: I reset author, so now it's you :) [21:50:29] YuviPanda: :) [21:50:34] 6operations: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1935474 (10Dzahn) [21:52:25] (03CR) 10Mattflaschen: [C: 04-1] "This preserves the existing behavior AFAICT, but should we additionally exclude fishbowl?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250460 (owner: 10Mattflaschen) [21:54:39] 6operations, 7Mail: remove exim alias oliver - https://phabricator.wikimedia.org/T123676#1935478 (10eliza) 3NEW a:3Dzahn [22:00:30] (03PS1) 10Andrew Bogott: Have labtest use the ldap server on labtestservices2001. [puppet] - 10https://gerrit.wikimedia.org/r/264192 [22:00:43] YuviPanda: You rock. [22:01:15] :D [22:01:23] however, someone needs to do the actual move [22:02:07] (03CR) 10Andrew Bogott: [C: 032] Have labtest use the ldap server on labtestservices2001. [puppet] - 10https://gerrit.wikimedia.org/r/264192 (owner: 10Andrew Bogott) [22:09:02] andre__: Project creators? [22:12:15] James_F, well I assume it would have its own project so people can watch it, if you don't want to make #Labs and such more noisy? :) [22:13:48] andre__: A project for what? [22:14:11] andre__: This wouldn't be a Maniphest thing, I think. But greg-g probably knows more. [22:14:32] James_F, for Tools access requests. [22:14:40] at least that's how I got it. Feel free to correct me. [22:14:52] It's not just turning into "file a task", I think. [22:15:03] There's development work first. Or something. [22:15:09] oh. that's what I assumed [22:17:03] !log deployed patch for T122807 [22:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:18:36] 6operations, 7Mail: remove exim alias oliver - https://phabricator.wikimedia.org/T123676#1935557 (10Dzahn) [22:18:50] 6operations, 7Mail: remove exim alias oliver - https://phabricator.wikimedia.org/T123676#1935478 (10Dzahn) [22:22:16] 6operations, 7Mail: remove exim alias oliver - https://phabricator.wikimedia.org/T123676#1935573 (10Ironholds_backup) Why? [22:24:21] 6operations, 7Mail: remove exim alias oliver - https://phabricator.wikimedia.org/T123676#1935584 (10Ironholds_backup) Okay, just found out the why. Cool! :D [22:27:40] (03PS4) 10Yuvipanda: Wikimetrics: Puppet module for wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/264134 (https://phabricator.wikimedia.org/T101763) [22:27:52] (03CR) 10Yuvipanda: [C: 032 V: 032] "Let's try again!" [puppet] - 10https://gerrit.wikimedia.org/r/264134 (https://phabricator.wikimedia.org/T101763) (owner: 10Yuvipanda) [22:33:50] 10Ops-Access-Requests, 6operations: add datacenter-ops to dhcp /install-server - https://phabricator.wikimedia.org/T123681#1935599 (10Dzahn) 3NEW [22:34:12] 10Ops-Access-Requests, 6operations: add datacenter-ops to dhcp /install-server - https://phabricator.wikimedia.org/T123681#1935607 (10Dzahn) [22:37:33] 6operations, 7Mail: remove exim aliases - wikimedia brazil - https://phabricator.wikimedia.org/T123682#1935626 (10JKrauska) 3NEW a:3Dzahn [22:40:40] 6operations, 7Mail: remove/update office@wikipedia.org alias - https://phabricator.wikimedia.org/T123669#1935654 (10JKrauska) no opinion -- I don't user wikipedia.org [22:40:52] 6operations, 7Mail: remove/update office@wikipedia.org alias - https://phabricator.wikimedia.org/T123669#1935655 (10JKrauska) a:5JKrauska>3Dzahn [22:47:40] 6operations, 7Mail: delete exim alias wikilibrary@ library@ - https://phabricator.wikimedia.org/T123666#1935687 (10JKrauska) I believe I can rework this as secondary ldap groups and google group aliases. [22:48:04] PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:48:51] hmm [22:49:31] fixed it [22:50:05] RECOVERY - Unmerged changes on repository puppet on labcontrol1002 is OK: No changes to merge. [22:58:54] 6operations: Add Marc Brent to fr-all - https://phabricator.wikimedia.org/T122972#1935727 (10Dzahn) Hi, sorry for the delay on our side. But yes, this is already done. Just checked and mbrent@ is on fr-development@ and that is on fr-all@ Maybe there were 2 tickets, yep. [23:03:30] 6operations: Add Marc Brent to fr-all - https://phabricator.wikimedia.org/T122972#1935732 (10Dzahn) 5Open>3Resolved a:3Dzahn [23:03:40] 6operations, 7Mail: Add Marc Brent to fr-all - https://phabricator.wikimedia.org/T122972#1918241 (10Dzahn) [23:09:31] hi greg-g, around? would like to schedule something for puppet swat next week but there's no calendar yet [23:10:50] Krenair: gimme 2 minutes, just doing that now :) [23:11:34] okay, thanks. would it be possible to have two weeks up at a time? or is that effort to maintain? [23:12:26] Krenair: 2 weeks added [23:12:46] I was doing that (2 weeks in the future) before the holidays, then... vacation and devsummit/all hands it slipped down my priority list [23:12:52] ah :) [23:13:09] Krenair: done [23:13:13] (03PS1) 10Dzahn: add url-downloader in codfw on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/264205 (https://phabricator.wikimedia.org/T122134) [23:13:18] oh, right, I already said that ;) [23:13:21] :) [23:13:46] hm [23:13:52] only puppet swat is 21st? [23:13:58] I was wondering that myself [23:14:03] isn't there one on tuesday? [23:14:10] * greg-g checks their team notes [23:14:32] did you copy from last week? might've been specifically removed from there.. [23:14:49] yeah, I just copy/pasta [23:14:53] https://office.wikimedia.org/wiki/Operations/Operations_Meeting_Notes/TechOps-2016-01-13#PuppetSWAT [23:15:45] (03PS1) 10Yuvipanda: tools: Log parent process cmdline for command invocations too [puppet] - 10https://gerrit.wikimedia.org/r/264206 (https://phabricator.wikimedia.org/T123444) [23:16:13] yeah there was one week where we cancelled it [23:17:25] YuviPanda: so, add one to tuesday? [23:17:44] I can after I finish merging these couple of patches :D [23:17:54] sure thing, that'd be solid [23:18:03] feel free to change the people listed to be correct ;) [23:18:06] or Krenair / you can too if you're already there, etc. [23:18:08] sure [23:18:10] will check when I'm done [23:18:10] (03PS1) 10Dzahn: add url-downloader-codfw service IP [dns] - 10https://gerrit.wikimedia.org/r/264208 (https://phabricator.wikimedia.org/T122134) [23:18:26] (03CR) 10jenkins-bot: [V: 04-1] add url-downloader-codfw service IP [dns] - 10https://gerrit.wikimedia.org/r/264208 (https://phabricator.wikimedia.org/T122134) (owner: 10Dzahn) [23:18:28] (03PS2) 10Dzahn: add url-downloader-codfw service IP [dns] - 10https://gerrit.wikimedia.org/r/264208 (https://phabricator.wikimedia.org/T122134) [23:18:58] (03CR) 10jenkins-bot: [V: 04-1] add url-downloader-codfw service IP [dns] - 10https://gerrit.wikimedia.org/r/264208 (https://phabricator.wikimedia.org/T122134) (owner: 10Dzahn) [23:19:54] (03PS2) 10Yuvipanda: tools: Log parent process cmdline for command invocations too [puppet] - 10https://gerrit.wikimedia.org/r/264206 (https://phabricator.wikimedia.org/T123444) [23:20:25] (03CR) 10Yuvipanda: [C: 032 V: 032] "Hopefully this lets us tell which ones are done interactively and which ones automated." [puppet] - 10https://gerrit.wikimedia.org/r/264206 (https://phabricator.wikimedia.org/T123444) (owner: 10Yuvipanda) [23:21:39] 6operations, 7Mail: remove/update office@wikipedia.org alias - https://phabricator.wikimedia.org/T123669#1935777 (10Dzahn) could you just confirm if the dphelps@ address is dead? [23:24:53] 6operations, 7Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#1935783 (10Dzahn) [23:24:55] 6operations, 7Mail: remove exim alias oliver - https://phabricator.wikimedia.org/T123676#1935781 (10Dzahn) 5Open>3Resolved done -oliver: okeyes - [23:34:24] (03PS1) 10Madhuvishy: wikimetrics: Add mysql-client package dependency to production role [puppet] - 10https://gerrit.wikimedia.org/r/264210 [23:34:51] YuviPanda: ^ sorry for the trouble today [23:35:01] (03CR) 10Yuvipanda: [C: 032 V: 032] wikimetrics: Add mysql-client package dependency to production role [puppet] - 10https://gerrit.wikimedia.org/r/264210 (owner: 10Madhuvishy) [23:35:47] madhuvishy: :D np! I'm happy we moved off the old code, and doubly so that I didn't have to do it :D [23:42:50] (03PS1) 10Andrew Bogott: Define special_hosts for labtest realm. [puppet] - 10https://gerrit.wikimedia.org/r/264213 [23:44:29] (03PS2) 10Andrew Bogott: Define special_hosts for labtest realm. [puppet] - 10https://gerrit.wikimedia.org/r/264213 [23:46:12] (03CR) 10Andrew Bogott: [C: 032] Define special_hosts for labtest realm. [puppet] - 10https://gerrit.wikimedia.org/r/264213 (owner: 10Andrew Bogott) [23:49:38] !log restbase start deploy of dac31a8c [23:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:15] !log restbase end deploy of dac31a8c [23:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master