[00:01:30] (03CR) 10Smalyshev: "Fixed a typo, but now I'm getting this trying to run it:" [puppet] - 10https://gerrit.wikimedia.org/r/223663 (owner: 10Smalyshev) [00:09:04] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1459240 (10demon) Even more pro-tip: put it in a [[/paste/create | pastebin ]] and then embed it with `{P123}`. Like this: {P123} Then y... [00:09:17] bblack: Protip ^ 2 [01:05:57] (03PS5) 10Dzahn: ganglia_new: add aggregator setting for ULSFO [puppet] - 10https://gerrit.wikimedia.org/r/225111 (https://phabricator.wikimedia.org/T93776) [01:06:36] (03CR) 10Dzahn: [C: 032] ganglia_new: add aggregator setting for ULSFO [puppet] - 10https://gerrit.wikimedia.org/r/225111 (https://phabricator.wikimedia.org/T93776) (owner: 10Dzahn) [01:08:47] (03PS1) 10Dzahn: Revert "Revert "ulsfo mobile caches: switch to ganglia_new"" [puppet] - 10https://gerrit.wikimedia.org/r/225268 [01:09:00] (03PS2) 10Dzahn: ulsfo mobile caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225268 [01:17:15] (03PS3) 10Dzahn: ulsfo mobile caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225268 [01:19:22] (03CR) 10Dzahn: [C: 032] "hopefully works now after adding the missing codfw aggregator config snippet" [puppet] - 10https://gerrit.wikimedia.org/r/225268 (owner: 10Dzahn) [01:26:29] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [01:29:35] sigh .. strontium [01:30:36] !log git pull origin on strontium [01:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:32:28] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [01:37:58] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [01:38:44] wut [01:39:49] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [01:42:41] yay, it works though. ganglia_new in ulsfo [01:43:00] _joe|afk: ^ for when you see this later. we can switch them all :) [01:45:06] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1459414 (10Dzahn) ^ that was the issue for ULSFO. I was able to switch the cluster "mobile caches ulsfo" succesfully now :) [01:46:33] (03CR) 10Alexandros Kosiaris: Add role::mediawiki_vagrant_lxc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) (owner: 10BryanDavis) [01:55:05] (03PS1) 10Alexandros Kosiaris: Fix errors introduced in I643edcd3576b8d92b [dns] - 10https://gerrit.wikimedia.org/r/225271 [01:57:32] (03CR) 10Alexandros Kosiaris: [C: 032] Fix errors introduced in I643edcd3576b8d92b [dns] - 10https://gerrit.wikimedia.org/r/225271 (owner: 10Alexandros Kosiaris) [01:59:59] (03PS1) 10Dzahn: lvs ulsfo: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225272 [02:00:01] (03PS1) 10Dzahn: ulsfo text caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225273 [02:00:03] (03PS1) 10Dzahn: ulsfo bits caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225274 [02:00:05] (03PS1) 10Dzahn: ulsfo upload caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225275 [02:00:36] (03PS2) 10Dzahn: ulsfo lvs: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225272 [02:02:39] (03PS1) 10Dzahn: ganglia_new: switch ULSFO to new by default [puppet] - 10https://gerrit.wikimedia.org/r/225276 [02:03:12] !log LocalisationUpdate failed (1.26wmf14) at 2015-07-17 02:03:12+00:00 [02:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:04:04] (03CR) 10Dzahn: [C: 032] ulsfo lvs: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225272 (owner: 10Dzahn) [02:07:23] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jul 17 02:07:22 UTC 2015 (duration 7m 20s) [02:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:08:06] (03PS2) 10Dzahn: ulsfo text caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225273 [02:09:36] (03CR) 10Dzahn: [C: 032] ulsfo text caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225273 (owner: 10Dzahn) [02:13:49] PROBLEM - puppet last run on palladium is CRITICAL puppet fail [02:14:27] ^ not a real fail, just me who interrupted a run [02:15:39] RECOVERY - puppet last run on palladium is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [02:17:01] (03PS2) 10Dzahn: ulsfo bits caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225274 [02:17:08] (03CR) 10Dzahn: [C: 032] ulsfo bits caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225274 (owner: 10Dzahn) [02:17:32] (03PS2) 10Dzahn: ulsfo upload caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225275 [02:22:49] (03CR) 10Dzahn: [C: 032] ulsfo upload caches: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/225275 (owner: 10Dzahn) [02:26:49] !log l10nupdate Synchronized php-1.26wmf14/cache/l10n: (no message) (duration: 05m 55s) [02:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:04] !log LocalisationUpdate completed (1.26wmf14) at 2015-07-17 02:30:03+00:00 [02:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:17] 6operations, 10hardware-requests: server for wikimania video transcoding - https://phabricator.wikimedia.org/T106112#1459481 (10RobH) 3NEW a:3Katherine-WMF [02:35:24] (03PS2) 10Dzahn: ganglia_new: switch ULSFO to new by default [puppet] - 10https://gerrit.wikimedia.org/r/225276 [02:35:42] 6operations, 10hardware-requests: server for wikimania video transcoding - https://phabricator.wikimedia.org/T106112#1459492 (10RobH) a:5Katherine-WMF>3RobH [02:38:40] 6operations, 10hardware-requests: server for wikimania video transcoding - https://phabricator.wikimedia.org/T106112#1459499 (10RobH) A spare system with this much space is available, in that I have a spare R510 @ eqiad that could be used. (It has 12*2tb disks, so it should be the best option to accomadate th... [02:40:28] 6operations, 10hardware-requests: server for wikimania video transcoding - https://phabricator.wikimedia.org/T106112#1459503 (10RobH) @yuvipanda: if you know the answer to the above questions, please let me know. [02:42:29] 6operations, 10hardware-requests: server for wikimania video transcoding - https://phabricator.wikimedia.org/T106112#1459507 (10Dzahn) This is the ticket from last year. -> T84465 which is still stalled. (and T84439, T84467) problems were: - file format not supported (https://commons.wikimedia.org/wiki/Comm... [02:42:55] 6operations, 10hardware-requests: server for wikimania video transcoding - https://phabricator.wikimedia.org/T106112#1459511 (10RobH) [02:44:22] (03PS3) 10Dzahn: ganglia_new: switch ULSFO to new by default [puppet] - 10https://gerrit.wikimedia.org/r/225276 [02:45:40] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1459526 (10Dzahn) merged: https://gerrit.wikimedia.org/r/#/c/225268/ https://gerrit.wikimedia.org/r/#/c/225272/ https://gerrit.wikimedia.org/r/#/c/225273/ https://gerrit.wiki... [03:04:18] PROBLEM - Disk space on maps-test2002 is CRITICAL: Connection refused by host [03:04:50] PROBLEM - RAID on maps-test2002 is CRITICAL: Connection refused by host [03:05:29] PROBLEM - configured eth on maps-test2002 is CRITICAL: Connection refused by host [03:05:48] PROBLEM - dhclient process on maps-test2002 is CRITICAL: Connection refused by host [03:05:49] PROBLEM - puppet last run on maps-test2002 is CRITICAL: Connection refused by host [03:05:59] PROBLEM - salt-minion processes on maps-test2002 is CRITICAL: Connection refused by host [03:07:18] PROBLEM - DPKG on maps-test2002 is CRITICAL: Connection refused by host [03:10:59] RECOVERY - DPKG on maps-test2002 is OK: All packages OK [03:11:09] RECOVERY - configured eth on maps-test2002 is OK - interfaces up [03:11:28] RECOVERY - dhclient process on maps-test2002 is OK: PROCS OK: 0 processes with command name dhclient [03:11:29] RECOVERY - puppet last run on maps-test2002 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [03:11:39] RECOVERY - salt-minion processes on maps-test2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:11:49] RECOVERY - Disk space on maps-test2002 is OK: DISK OK [03:12:28] RECOVERY - RAID on maps-test2002 is OK no RAID installed [03:30:05] (03PS1) 10Springle: repool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225279 [03:30:41] (03CR) 10Springle: [C: 032] repool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225279 (owner: 10Springle) [03:30:47] (03Merged) 10jenkins-bot: repool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225279 (owner: 10Springle) [03:31:52] !log springle Synchronized wmf-config/db-eqiad.php: repool db1030 (duration: 00m 12s) [03:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:37:35] (03PS15) 10BryanDavis: Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) [03:40:17] 6operations, 10hardware-requests: server for wikimania video transcoding - https://phabricator.wikimedia.org/T106112#1459598 (10Matanya) regarding the usage, is temp space for transcoding. as for the actual way of getting the drive, i'd like @victorgrigas to comment. [03:58:28] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [04:04:09] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [04:24:24] (03PS1) 10Ori.livneh: Remove obsolete VCL code for setting X-Analytics: https=1 [puppet] - 10https://gerrit.wikimedia.org/r/225280 [04:28:34] (03PS1) 10Ori.livneh: Rename 'cookie_munging' VCL subroutine to 'stash_cookie' [puppet] - 10https://gerrit.wikimedia.org/r/225281 [04:29:01] (03PS16) 10BryanDavis: Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) [04:30:14] (03PS17) 10BryanDavis: Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) [04:33:18] (03CR) 10BBlack: [C: 04-1] "We should wait a bit longer first. Right now we still do have HTTP traffic in various corner cases, which we might want to analyze as it'" [puppet] - 10https://gerrit.wikimedia.org/r/225280 (owner: 10Ori.livneh) [04:53:56] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jul 17 04:53:56 UTC 2015 (duration 53m 55s) [04:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:06:14] (03CR) 10BryanDavis: [C: 04-1] "vagrant-lxc package is not showing up as an installed plugin for vagrant. :(" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) (owner: 10BryanDavis) [05:15:20] 6operations, 10Traffic, 7HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1459665 (10BBlack) Just thinking out loud: while we're working through all the more-complex issues about doing this "right", should we consider an interim solution that provide... [05:31:32] 6operations, 10Analytics, 10Traffic: Provide summary of MediaWiki downloads - https://phabricator.wikimedia.org/T104010#1459671 (10Legoktm) >>! In T104010#1456863, @MarkAHershberger wrote: > I got Kevin Luduc here at Wikimania to get us the information we needed. > > Thanks! Can these be published? [05:44:48] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 30.77% of data above the critical threshold [500.0] [05:59:38] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [06:26:27] (03PS1) 10Giuseppe Lavagetto: imagescalers: re-image mw115[6-8] to trusty, HHVM [puppet] - 10https://gerrit.wikimedia.org/r/225285 (https://phabricator.wikimedia.org/T84842) [06:27:17] (03CR) 10Giuseppe Lavagetto: [C: 032] "I'll complete the transition early next week (maybe on monday)." [puppet] - 10https://gerrit.wikimedia.org/r/225285 (https://phabricator.wikimedia.org/T84842) (owner: 10Giuseppe Lavagetto) [06:29:50] PROBLEM - puppet last run on db1028 is CRITICAL puppet fail [06:30:09] PROBLEM - puppet last run on mw1060 is CRITICAL Puppet has 1 failures [06:30:59] PROBLEM - puppet last run on cp2013 is CRITICAL Puppet has 2 failures [06:31:08] PROBLEM - puppet last run on cp1068 is CRITICAL Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on db2055 is CRITICAL Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures [06:31:59] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 1 failures [06:31:59] PROBLEM - puppet last run on mw2158 is CRITICAL Puppet has 1 failures [06:32:09] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:32:49] PROBLEM - puppet last run on wtp2017 is CRITICAL Puppet has 1 failures [06:32:58] PROBLEM - puppet last run on cp4016 is CRITICAL Puppet has 1 failures [06:33:28] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:35:39] (03CR) 10Muehlenhoff: "The argument parsing is incorrect, the current version of check_conntrack doeesn't use getopt/argparse-stype option parsing. With the way " [puppet] - 10https://gerrit.wikimedia.org/r/223560 (owner: 10Matanya) [06:35:51] _joe_: hi [06:51:39] (03PS1) 10Muehlenhoff: The argument passing to check_conntrack is incorrect, it doesn't provide getopt-style argument parsing and with the currently passed arguments the script only displays the usage information. [puppet] - 10https://gerrit.wikimedia.org/r/225286 (https://phabricator.wikimedia.org/T105154) [06:55:39] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:09] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:56:18] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:18] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:56:19] RECOVERY - puppet last run on mw1060 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] The argument passing to check_conntrack is incorrect, it doesn't provide getopt-style argument parsing and with the currently passed argumen [puppet] - 10https://gerrit.wikimedia.org/r/225286 (https://phabricator.wikimedia.org/T105154) (owner: 10Muehlenhoff) [06:57:09] RECOVERY - puppet last run on wtp2017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:09] RECOVERY - puppet last run on cp2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:10] RECOVERY - puppet last run on cp1068 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:10] RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:57:29] RECOVERY - puppet last run on db2055 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:57:49] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:59] RECOVERY - puppet last run on db1028 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:19] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:55] <_joe_> SMalyshev: hi [07:06:18] <_joe_> SMalyshev: I have amended the patch, not a completed work though, but I have one big question [07:06:29] <_joe_> SMalyshev: why nginx access logs to logstash? [07:06:38] <_joe_> I'd send the error logs instead [07:07:00] _joe_: well, we need to put the access logs somewhere. so I asked and I was told logstash is the best place [07:07:13] <_joe_> who told you that? :) [07:07:18] _joe_: we can send error logs too but there aren't that much of them [07:07:36] <_joe_> I'd send access logs locally, and errors to logstash, but we can settle that down later [07:07:59] <_joe_> SMalyshev: apart from that, I have some more polishing to do and in particular to install the server [07:08:02] _joe_: hmm.... not sure I remember, there was a number of people I talked about it. What better option is there? [07:08:18] <_joe_> SMalyshev: I'd keep the access logs locally [07:08:26] _joe_: I tried to run that config (after fixing one typo) and I got this: rror: Could not retrieve catalog from remote server: Could not intern from text/pson: Could not intern from data: Could not find relationship target "File[]" [07:08:26] <_joe_> but that's just my 2 cents [07:08:43] _joe_: if you keep them locally how they could be viewed/analyzed? [07:08:48] <_joe_> SMalyshev: yes it neeeds some fixing [07:09:12] <_joe_> SMalyshev: for analytics purposes, they should not be on logstash, but we can speak with analytics people [07:09:17] _joe_: ah, ok, because I have no idea what that error means [07:09:44] _joe_: I spoke to analytics people, but that didn't get me far since labs didn't have analytics infrastructure [07:09:48] <_joe_> SMalyshev: it means I put a variable somewhere that was unassigned for some reason [07:10:11] _joe_: ah, I guessed so but I couldn't find which variable that was [07:10:31] <_joe_> SMalyshev: also, I wanted to do some more refactoring [07:10:47] <_joe_> but in general it was a very good patch :) [07:11:45] <_joe_> I would, if you don't mind, rename "wdqs::gui" => "wdqs::web" [07:13:26] _joe_: why "web"? [07:13:43] <_joe_> well, it installs the webserver and a virtual host [07:14:04] <_joe_> but it's just that I found that a bit confusing [07:14:29] _joe_: if you want fancy name, maybe "frontend" or something? "web" sounds confusing to me too a bit [07:14:45] <_joe_> ok, frontend is not confusing to me either :) [07:18:23] (03PS1) 10Alex Monk: Get rid of default=wikipedia assumptions in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225287 (https://phabricator.wikimedia.org/T104088) [07:24:50] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1459753 (10Krenair) [07:24:52] 6operations, 7Tracking: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) (tracking) - https://phabricator.wikimedia.org/T65899#1459756 (10Krenair) [07:24:58] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1459757 (10Krenair) [07:25:03] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 5 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1459751 (10Krenair) 5stalled>3Open [07:31:56] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1459786 (10MoritzMuehlenhoff) There's also a new round of security fixes for Java; the OpenJDK updates will probably be available beginning of next week, so... [07:36:19] 6operations: Update Elasticsearch on logstash* - https://phabricator.wikimedia.org/T106126#1459787 (10MoritzMuehlenhoff) 3NEW [07:37:00] 6operations, 7Availability, 7Performance, 7Wikimedia-log-errors: Memcached error for key "WANCache:v:enwiki:image_redirect:254363f3d14af58bbe12c644ee69ccf7" on server "/var/run/nutcracker/nutcracker.sock:0": A TIMEOUT OCCURRED - https://phabricator.wikimedia.org/T102916#1459794 (10Nemo_bis) [07:45:42] https://wikitech.wikimedia.org/wiki/Talk:Incident_documentation/20150205-SiteOutage [07:51:30] <_joe_> !log depooled mw1156,7 for reimaging [07:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:52:17] <_joe_> Nemo_bis: the only remaining ticket from that is [07:52:44] <_joe_> https://phabricator.wikimedia.org/T88730 [07:52:47] Nemo_bis, replied [07:52:49] <_joe_> assigned to yours truly [07:56:14] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1459837 (10Tau) >>! In T102566#1439571, @Tgr wrote: >>>! In T102566#1434294, @Tau wrote: >> Still nothing ... Any further rec... [08:09:17] 6operations, 6Labs: upgrade salt to 2015.5 - https://phabricator.wikimedia.org/T106074#1459870 (10Joe) I Strongly object to upgrading without a thorough evaluation, we upgraded to 2014.7 for similar reasons and look where it got us. I'm pretty sure 2015.5 has its own bunch of problems. We should first pin dow... [08:09:24] 6operations, 6Labs: upgrade salt to 2015.5 - https://phabricator.wikimedia.org/T106074#1459871 (10Joe) 5Open>3declined [08:10:50] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1459879 (10Joe) p:5Triage>3Normal [08:11:41] 6operations, 6Labs: upgrade salt to 2015.5 - https://phabricator.wikimedia.org/T106074#1459882 (10MoritzMuehlenhoff) > We should first pin down what problems do we have, maybe work in order to even the zeromq versions across the cluster, I created https://phabricator.wikimedia.org/T106093 for that yesterday. [08:11:59] Krenair: thanks; your signature is annoying, btw [08:12:06] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1459883 (10Joe) a:3Joe [08:24:27] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1459893 (10Joe) Ok this keeps getting more mysterious then. Requesting this resource using our dedicated proxy seems to DTRT From my computer: ``` ~$ curl http://www.ncbi.nlm.nih.gov/pmc/a... [08:25:58] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:26:35] <_joe_> rb dead on rb1004? [08:27:48] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [08:27:57] 6operations, 10Incident-20150205-SiteOutage, 7Availability: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1459895 (10Nemo_bis) [08:28:48] 6operations, 10Incident-20150205-SiteOutage, 7Availability: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1459899 (10Joe) @ori [08:34:44] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1459900 (10mobrovac) Debug info from `sca1001`: ``` {"name":"citoid","hostname":"sca1001","pid":5623,"level":20,"from":"PMC9999999","to":"http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9999999... [08:35:26] did someone restart rb on rb1004? [08:35:35] _joe_: ^^ ? [08:35:59] <_joe_> mobrovac: I most surely didn't [08:36:45] kk [08:44:03] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1459906 (10mobrovac) The Citoid SHA1 hash in the deploy repo (and the deployed version on `sca100x`) matches the source repo SHA1, so we are running the latest version, and the requests from... [08:47:03] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1459909 (10Joe) So, the difference seem to be in the "successfully scraped resource", which has no reason to be given the response we get from the proxy [08:58:13] (03PS2) 10Filippo Giunchedi: eliminate redundant threshold alerts [puppet] - 10https://gerrit.wikimedia.org/r/225184 (owner: 10Eevans) [08:58:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] eliminate redundant threshold alerts [puppet] - 10https://gerrit.wikimedia.org/r/225184 (owner: 10Eevans) [09:04:10] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1459915 (10Joe) FWIW, I checked the url-downloader access logs and they clearly show the resource returns 404. [09:04:15] (03PS1) 10Nemo bis: Add some more redis monitoring metrics to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225292 [09:08:26] <_joe_> !log depooling mw1158, repooling mw1156,7 [09:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:38:59] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1459959 (10Tgr) [[ https://www.mediawiki.org/wiki/Manual:$wgDebugLogFile | Manual:$wgDebugLogFile ]] recommends checking `ope... [09:50:27] (03CR) 10Filippo Giunchedi: "looks good, I've prefixed pedantic/nitpick comments with ~ so feel free to ignore or postpone those!" (0311 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 (owner: 10Thcipriani) [10:02:42] 6operations, 10Beta-Cluster, 5Patch-For-Review: deployment-bastion fails puppet because some classes were moved from nodes to role class - https://phabricator.wikimedia.org/T106003#1459979 (10hashar) 5Open>3Resolved Beta cluster puppetmaster rebased just fine and puppet pass on deployment-bastion. Danke... [10:04:36] PROBLEM - dhclient process on mw1158 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:05:08] PROBLEM - nutcracker port on mw1158 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:05:31] oh my god systemd [10:05:36] PROBLEM - nutcracker process on mw1158 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:05:42] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1459983 (10mobrovac) Ok, I finally managed to track it down - NCBI has blacklisted us apparently: ``` mobrovac@sca1001:~$ curl -v -H'User-Agent: WikimediaBot' -x http://url-downloader.wikim... [10:05:47] PROBLEM - puppet last run on mw1158 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:05:47] PROBLEM - DPKG on mw1158 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:05:57] PROBLEM - Disk space on mw1158 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:05:57] PROBLEM - salt-minion processes on mw1158 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:06:16] PROBLEM - HHVM processes on mw1158 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:06:20] bad bad mw1158 bad, *hits with newspaper* [10:06:58] PROBLEM - RAID on mw1158 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:07:26] PROBLEM - configured eth on mw1158 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [10:13:26] RECOVERY - salt-minion processes on mw1158 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:13:26] RECOVERY - Disk space on mw1158 is OK: DISK OK [10:13:46] RECOVERY - HHVM processes on mw1158 is OK: PROCS OK: 6 processes with command name hhvm [10:13:57] RECOVERY - dhclient process on mw1158 is OK: PROCS OK: 0 processes with command name dhclient [10:14:36] RECOVERY - nutcracker port on mw1158 is OK: TCP OK - 0.000 second response time on port 11212 [10:14:37] RECOVERY - RAID on mw1158 is OK no RAID installed [10:14:56] RECOVERY - nutcracker process on mw1158 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [10:14:57] RECOVERY - configured eth on mw1158 is OK - interfaces up [10:15:07] RECOVERY - DPKG on mw1158 is OK: All packages OK [10:15:07] PROBLEM - puppet last run on mw1158 is CRITICAL Puppet has 6 failures [10:16:26] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [10:18:57] RECOVERY - puppet last run on mw1158 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [10:19:48] 6operations, 7Database, 5Patch-For-Review: db1022 duplicate key errors - https://phabricator.wikimedia.org/T105879#1459995 (10jcrespo) @springle db1022 was not exported/imported- it was physically backed up and restarted. [10:21:52] <_joe_> !log repooling mw1158 [10:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:23:08] ACKNOWLEDGEMENT - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) Giuseppe Lavagetto T106044 [10:26:36] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [10:26:46] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [10:28:37] 6operations, 10RESTBase-Cassandra: stricter permissions on cassandra data dir - https://phabricator.wikimedia.org/T106133#1460004 (10fgiunchedi) 3NEW a:3fgiunchedi [10:29:48] (03PS1) 10Filippo Giunchedi: cassandra: restrict data directory permissions [puppet] - 10https://gerrit.wikimedia.org/r/225300 (https://phabricator.wikimedia.org/T106133) [10:42:57] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:44:23] (03CR) 10Mobrovac: [C: 04-1] cassandra: restrict data directory permissions (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/225300 (https://phabricator.wikimedia.org/T106133) (owner: 10Filippo Giunchedi) [10:47:12] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1460027 (10Joe) From their abuse page: "please have your system administrator contact info@ncbi.nlm.nih.gov". I think we can reach out to them. Any relevant link to explain what we are doi... [10:47:25] (03PS2) 10Filippo Giunchedi: cassandra: restrict data directory permissions [puppet] - 10https://gerrit.wikimedia.org/r/225300 (https://phabricator.wikimedia.org/T106133) [10:47:46] PROBLEM - SSH on labnodepool1001 is CRITICAL: Server answer [10:48:31] (03CR) 10Filippo Giunchedi: cassandra: restrict data directory permissions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/225300 (https://phabricator.wikimedia.org/T106133) (owner: 10Filippo Giunchedi) [10:54:47] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1460028 (10mobrovac) >>! In T106044#1460027, @Joe wrote: > From their abuse page: > > "please have your system administrator contact info@ncbi.nlm.nih.gov". > > I think we can reach out to... [10:59:27] 6operations, 10Citoid, 6Services: Citoid returns 200 for inexistent PMCIDs - https://phabricator.wikimedia.org/T106044#1460034 (10Joe) From http://www.ncbi.nlm.nih.gov/robots.txt ``` User-agent: * Crawl-delay: 5 [SNIP] Disallow: /pmc/articles/ ``` Can't say I blame them for blocking us. [11:01:14] 6operations, 10Traffic, 7HTTPS: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#1460036 (10Chmarkine) How about doing "report-only" first with a longer max-age, like 7 days? [11:02:47] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [11:02:58] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [11:03:07] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [11:06:57] RECOVERY - SSH on labnodepool1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [11:10:13] (03CR) 10Mobrovac: [C: 031] cassandra: restrict data directory permissions [puppet] - 10https://gerrit.wikimedia.org/r/225300 (https://phabricator.wikimedia.org/T106133) (owner: 10Filippo Giunchedi) [11:18:19] 6operations, 10Citoid, 6Services: Citoid is blacklisted from ncbi.nlm.nih.gov - https://phabricator.wikimedia.org/T106044#1460060 (10Joe) p:5High>3Unbreak! [11:22:51] so I screwed labnodpool1001 by deleting the whole /dev :/ [11:24:05] !log rebooted labnodepool1001.eqiad.wmnet . Accidentally deleted the whole /dev which freeze everything :( [11:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:29:31] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1460075 (10Tau) In both php.ini files the ";open_basedir= " (blank). Is it okay? Is this string set anywhere else too in addi... [11:38:06] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [11:38:06] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [11:42:05] I have disabled puppet on labnodepool / gotta hack [11:51:26] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [11:51:27] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [12:01:26] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [12:01:30] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1460094 (10Manybubbles) Cool. You can upgrade java anytime you like so long as its still a 1.7. If 1.8 is in apt and not a mess we can validate cirrus again... [12:03:16] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [12:07:07] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [12:08:58] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [12:10:57] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [12:11:51] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1460104 (10MoritzMuehlenhoff) We don't have Java 1.8 in Debian jessie yet, so this will be the latest security bugfix release for 1.7 as packaged by OpenJDK 7 [12:16:48] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [12:20:51] (03PS3) 10Hashar: nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) [12:20:57] (03CR) 10jenkins-bot: [V: 04-1] nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [12:21:41] (03CR) 10Hashar: "The service is no more daemonizing itself. Instead takes advantage of systemd to track the process." [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [12:24:36] PROBLEM - Restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:25:38] PROBLEM - Restbase root url on restbase1004 is CRITICAL - Socket timeout after 10 seconds [12:26:28] (03PS3) 10Hashar: nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 [12:26:34] (03CR) 10jenkins-bot: [V: 04-1] nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [12:26:41] (03CR) 10Hashar: "also log 'apscheduler'" [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [12:27:54] * mobrovac on rb on rb1004 [12:30:16] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [12:30:18] RECOVERY - Restbase endpoints health on restbase1004 is OK: All endpoints are healthy [12:30:26] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [12:31:28] RECOVERY - Restbase root url on restbase1004 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.008 second response time [12:32:07] PROBLEM - Restbase root url on restbase1005 is CRITICAL - Socket timeout after 10 seconds [12:32:23] oh come on [12:32:36] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:33:56] RECOVERY - Restbase root url on restbase1005 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.007 second response time [12:34:17] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [12:42:15] (03PS1) 10Hashar: nodepool: typo in conf template (dib_cache_dir) [puppet] - 10https://gerrit.wikimedia.org/r/225308 [12:45:38] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [12:45:47] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [12:49:29] 6operations, 10Citoid, 6Services: Citoid is blacklisted from ncbi.nlm.nih.gov - https://phabricator.wikimedia.org/T106044#1460156 (10Mvolz) If they aren't blocking Zotero then this *only* affects our scraping of the page following Zotero failure. In the interim we should completely remove the 'scraping' fal... [12:54:43] (03CR) 10Andrew Bogott: [C: 032] nodepool: typo in conf template (dib_cache_dir) [puppet] - 10https://gerrit.wikimedia.org/r/225308 (owner: 10Hashar) [12:56:38] 6operations, 10Citoid, 6Services: Citoid is blacklisted from ncbi.nlm.nih.gov - https://phabricator.wikimedia.org/T106044#1460168 (10Joe) @Mvolz I reached out pointing that out. We got a first response but I guess it will take some time to get this addressed. Also, who confirmed that zotero is working indeed? [12:58:12] (03PS2) 10Alexandros Kosiaris: Assign roles to maps-test200{1,2,3,4} [puppet] - 10https://gerrit.wikimedia.org/r/225248 [13:00:59] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [13:01:16] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [13:02:57] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [13:03:07] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [13:05:29] (03PS3) 10Alexandros Kosiaris: Assign roles to maps-test200{1,2,3,4} [puppet] - 10https://gerrit.wikimedia.org/r/225248 [13:05:35] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Assign roles to maps-test200{1,2,3,4} [puppet] - 10https://gerrit.wikimedia.org/r/225248 (owner: 10Alexandros Kosiaris) [13:07:37] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:08:26] PROBLEM - Restbase root url on restbase1006 is CRITICAL - Socket timeout after 10 seconds [13:10:37] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [13:11:38] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:07] PROBLEM - Restbase root url on restbase1002 is CRITICAL - Socket timeout after 10 seconds [13:12:36] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [13:14:56] PROBLEM - puppet last run on maps-test2002 is CRITICAL puppet fail [13:17:12] 6operations: Evaluate traffic flow between the Jobrunners and the Cirrus cluster - https://phabricator.wikimedia.org/T105705#1460192 (10Joe) 32 Mbit/s doesn't seem like something sane to stream between the two datacenters, IMO. I'll wait for confirmation from @faidon or @mark, but I guess we should reduce our ba... [13:17:44] <_joe_> whatsup with restbase? [13:17:55] <_joe_> seems the rb hosts have been restarting constantly today [13:18:17] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [13:20:07] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [13:22:01] 6operations, 10Citoid, 6Services: Citoid is blacklisted from ncbi.nlm.nih.gov - https://phabricator.wikimedia.org/T106044#1460203 (10Joe) Update: I can confirm zotero works, I see outgoing requests from our proxy that clearly get the correct responses. I second @Mvolz idea for an "hotfix". I would also allo... [13:24:15] _joe_: re https://phabricator.wikimedia.org/T88730#1459899 it would be nice to disable nutcracker for session cache until the situation is clarified [13:24:57] PROBLEM - puppet last run on maps-test2001 is CRITICAL puppet fail [13:25:08] <_joe_> Nemo_bis: well, sessions are on redis [13:25:11] <_joe_> not memcached [13:26:12] <_joe_> or did I miss something there? [13:26:32] <_joe_> also, any nutcracker-related session failure would appear in the memcached log [13:27:04] Hm, modules/nutcracker/manifests/init.pp mentions both redis and memcached but I didn't check whether it's actually used [13:27:27] PROBLEM - puppet last run on maps-test2003 is CRITICAL puppet fail [13:28:33] <_joe_> Nemo_bis: no we don't use nutcracker for redis [13:28:40] <_joe_> it's possible to, we don't [13:29:18] $wgObjectCaches['memcachedpecl'] seems the only mention of port 11212 indeed [13:29:27] (03PS7) 10Jakob: Add Phragile module. [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) [13:34:58] PROBLEM - puppet last run on maps-test2004 is CRITICAL puppet fail [13:35:28] (03CR) 10Jakob: Add Phragile module. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [13:41:17] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 30.77% of data above the critical threshold [500.0] [13:42:03] <_joe_> looks serious ^^ [13:42:12] <_joe_> going to look in a second [13:43:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 2 below the confidence bounds [13:43:50] !log apache2ctl graceful on netmon1001 [13:43:51] <_joe_> just a spike [13:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:56:16] !log apache2ctl graceful on fluorine antimony argon caesium helium [13:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:56:37] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [14:00:37] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [14:03:32] (03PS1) 10Giuseppe Lavagetto: WDQS: install info, site.pp for wdqs100* [puppet] - 10https://gerrit.wikimedia.org/r/225317 (https://phabricator.wikimedia.org/T95679) [14:05:36] !log restart restbase service on restbase1002 [14:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:05:52] (03CR) 10Nikerabbit: "Trying to make this patch appear non-bold in my review list." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [14:05:57] (03CR) 10Nikerabbit: [C: 031] Install Extension:Translate on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [14:06:29] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1460245 (10Manybubbles) Cool. Then the jdk upgrade can hit the machines any time. We can do it when we're logged in for the the rolling restart or you can.... [14:06:39] (03PS1) 10Alexandros Kosiaris: maps: Remove erroneous role:: prefix [puppet] - 10https://gerrit.wikimedia.org/r/225318 [14:07:17] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [14:07:21] !log restart restbase service on restbase1003 [14:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:48] RECOVERY - Restbase root url on restbase1002 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.007 second response time [14:08:07] (03CR) 10Alexandros Kosiaris: [C: 032] maps: Remove erroneous role:: prefix [puppet] - 10https://gerrit.wikimedia.org/r/225318 (owner: 10Alexandros Kosiaris) [14:09:36] (03PS2) 10Giuseppe Lavagetto: WDQS: install info, site.pp for wdqs100* [puppet] - 10https://gerrit.wikimedia.org/r/225317 (https://phabricator.wikimedia.org/T95679) [14:10:09] !log restart restbase service on restbase1006 [14:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:47] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [14:11:36] RECOVERY - Restbase root url on restbase1006 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.009 second response time [14:12:58] (03PS4) 10Hashar: nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) [14:13:04] (03CR) 10jenkins-bot: [V: 04-1] nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [14:13:43] (03CR) 10Hashar: "Finally I am back to using it as daemon or I would logging does not work. I added a tiny script for ExecStop that waits for the service t" [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [14:17:22] (03PS1) 10Alexandros Kosiaris: maps: Move the hiera datafiles in the correct location [puppet] - 10https://gerrit.wikimedia.org/r/225319 [14:23:28] (03CR) 10Alexandros Kosiaris: [C: 032] maps: Move the hiera datafiles in the correct location [puppet] - 10https://gerrit.wikimedia.org/r/225319 (owner: 10Alexandros Kosiaris) [14:28:33] (03PS3) 10Giuseppe Lavagetto: WDQS: install info, site.pp for wdqs100* [puppet] - 10https://gerrit.wikimedia.org/r/225317 (https://phabricator.wikimedia.org/T95679) [14:28:44] (03CR) 10Giuseppe Lavagetto: [C: 032] WDQS: install info, site.pp for wdqs100* [puppet] - 10https://gerrit.wikimedia.org/r/225317 (https://phabricator.wikimedia.org/T95679) (owner: 10Giuseppe Lavagetto) [14:30:44] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1460260 (10MoritzMuehlenhoff) I'll update the Java packages once the updates are available and update this task once done. [14:39:39] 6operations, 10Citoid, 6Services: Citoid is blacklisted from ncbi.nlm.nih.gov - https://phabricator.wikimedia.org/T106044#1460268 (10mobrovac) >>! In T106044#1460203, @Joe wrote: > Update: I can confirm zotero works, I see outgoing requests from our proxy that clearly get the correct responses. Zotero uses... [14:41:37] RECOVERY - puppet last run on maps-test2001 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:43:42] (03CR) 10Yuvipanda: [C: 031] "Was joking, sorry :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [14:47:39] (03PS1) 10Giuseppe Lavagetto: eqiad: add wdqs1001 [dns] - 10https://gerrit.wikimedia.org/r/225321 [14:48:48] 6operations, 10Wikimedia-Logstash: Update Elasticsearch on logstash* - https://phabricator.wikimedia.org/T106126#1460276 (10bd808) [14:49:19] (03PS2) 10Giuseppe Lavagetto: eqiad: add wdqs1001 [dns] - 10https://gerrit.wikimedia.org/r/225321 [14:49:40] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] eqiad: add wdqs1001 [dns] - 10https://gerrit.wikimedia.org/r/225321 (owner: 10Giuseppe Lavagetto) [14:52:43] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1460289 (10Andrew) [14:59:39] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1460327 (10Andrew) [15:04:44] !log restarted RB thinner scripts, see https://phabricator.wikimedia.org/T105706 [15:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:40] (03PS1) 10Amire80: Add wgSitename and wgMetaNamespace for pnbwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225322 [15:13:56] RECOVERY - puppet last run on maps-test2002 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:14:45] 6operations, 6Labs, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1460389 (10Andrew) [15:16:47] RECOVERY - puppet last run on maps-test2004 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:16:57] RECOVERY - puppet last run on maps-test2003 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:18:52] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1460393 (10Andrew) 3NEW [15:22:36] PROBLEM - puppet last run on maps-test2004 is CRITICAL Puppet has 1 failures [15:22:47] PROBLEM - puppet last run on maps-test2003 is CRITICAL Puppet has 1 failures [15:23:27] PROBLEM - puppet last run on maps-test2002 is CRITICAL Puppet has 1 failures [15:46:22] 6operations, 10Analytics, 10Traffic: Provide summary of MediaWiki downloads - https://phabricator.wikimedia.org/T104010#1460485 (10Ckoerner) Thanks to @kevinator for providing these statistics of MediaWiki downloads. From Kevin, A couple of caveats around all the data: - I cannot filter out bots downloadin... [15:46:52] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1460487 (10bd808) 5Resolved>3Open Some ruby error in the backported package: ``` $ vagrant plugin list /usr/lib/ruby/vendor_ruby/vagrant/pre-rubygems.rb:19:in `requ... [15:54:24] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1460499 (10akosiaris) So, just to reitarate: *kartotherian: the PNG rendering service user. Should have SELECT only on all tables. Effectively... [16:14:14] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1460542 (10Tgr) >>! In T102566#1460075, @Tau wrote: > In both php.ini files the ";open_basedir= " (blank). Is it okay? Is thi... [16:21:14] 10Ops-Access-Requests, 10Ops-Access-Reviews, 6operations: Provide hoo (Marius Hoch) with Hive access - https://phabricator.wikimedia.org/T106045#1460576 (10Joe) [16:24:55] 6operations, 10Citoid, 6Services: Citoid is blacklisted from ncbi.nlm.nih.gov - https://phabricator.wikimedia.org/T106044#1460583 (10Joe) @mobrovac let's wait to see if someone is able to answer me about the block on the NIH side, but my hopes are not that high at this point. If not, we can hotfix the proble... [16:30:06] RECOVERY - puppet last run on maps-test2002 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:31:07] RECOVERY - puppet last run on maps-test2004 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:33:27] RECOVERY - puppet last run on maps-test2003 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:42:38] 6operations, 10Citoid, 6Services: Citoid is blacklisted from ncbi.nlm.nih.gov - https://phabricator.wikimedia.org/T106044#1460654 (10mobrovac) >>! In T106044#1460583, @Joe wrote: > @mobrovac let's wait to see if someone is able to answer me about the block on the NIH side, but my hopes are not that high at t... [16:44:08] 6operations, 10Citoid, 6Services, 10VisualEditor, 3VisualEditor 2015/16 Q1 blockers: Citoid is blacklisted from ncbi.nlm.nih.gov - https://phabricator.wikimedia.org/T106044#1460659 (10Jdforrester-WMF) [16:44:30] 6operations, 6Services, 5Patch-For-Review, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1460668 (10mobrovac) [16:45:07] PROBLEM - Restbase root url on restbase1001 is CRITICAL - Socket timeout after 10 seconds [16:45:19] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:49:34] (03PS1) 10Andrew Bogott: Set up a keypair for cold migration. [puppet] - 10https://gerrit.wikimedia.org/r/225332 (https://phabricator.wikimedia.org/T106145) [16:49:45] !log restarted restbase on restbase1001 [16:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:46] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.018 second response time [16:51:06] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [16:53:00] 6operations, 10CirrusSearch, 6Discovery: Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1460691 (10Manybubbles) [16:53:39] 6operations, 10CirrusSearch, 6Discovery: Validate Cirrus against Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106160#1460693 (10Manybubbles) 3NEW [16:53:45] (03CR) 10Hashar: "I found out a nasty bug in Nodepool. It does not gracefully stop because only some of the threads are not shutdown. So more work is needed" [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [16:53:51] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Validate Cirrus against Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106160#1460693 (10Manybubbles) [16:54:18] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Release wikimedia-extra plugin for Elasticsearch 1.6.0 - https://phabricator.wikimedia.org/T106161#1460701 (10Manybubbles) 3NEW [16:54:37] 6operations, 10CirrusSearch, 6Discovery: Release experimental-highlighter for 1.7.0 - https://phabricator.wikimedia.org/T106162#1460707 (10Manybubbles) 3NEW [16:54:38] (03PS2) 10Andrew Bogott: Set up a keypair for cold migration. [puppet] - 10https://gerrit.wikimedia.org/r/225332 (https://phabricator.wikimedia.org/T106145) [16:54:40] (03CR) 10Hashar: [C: 04-1] "The config is screwed up somehow. Nothing is being logged although the process does have open file handles on debug.log etc." [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [16:55:28] 6operations, 10CirrusSearch, 6Discovery: Release swift-repository for 1.7.0 - https://phabricator.wikimedia.org/T106163#1460716 (10Manybubbles) 3NEW [16:56:07] 6operations, 10CirrusSearch, 6Discovery: Upgrade beta to Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106164#1460722 (10Manybubbles) 3NEW [16:56:17] 6operations, 10CirrusSearch, 6Discovery: Upgrade production to 1.7.0 - https://phabricator.wikimedia.org/T106165#1460728 (10Manybubbles) 3NEW [16:57:26] 6operations, 10CirrusSearch, 6Discovery: Release swift-repository for 1.7.0 - https://phabricator.wikimedia.org/T106163#1460716 (10Manybubbles) [16:57:28] 6operations, 10CirrusSearch, 6Discovery: Upgrade beta to Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106164#1460722 (10Manybubbles) [16:57:34] 6operations, 10CirrusSearch, 6Discovery: Upgrade production to 1.7.0 - https://phabricator.wikimedia.org/T106165#1460728 (10Manybubbles) [16:57:36] 6operations, 10CirrusSearch, 6Discovery: Upgrade beta to Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106164#1460722 (10Manybubbles) [16:57:46] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Upgrade beta to Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106164#1460722 (10Manybubbles) [16:58:11] 6operations, 10CirrusSearch, 6Discovery: Upgrade production to 1.7.0 - https://phabricator.wikimedia.org/T106165#1460728 (10Manybubbles) [16:58:13] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Upgrade beta to Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106164#1460753 (10Manybubbles) [16:58:15] 6operations, 10CirrusSearch, 6Discovery: Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1460751 (10Manybubbles) [16:58:33] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Upgrade production to 1.7.0 - https://phabricator.wikimedia.org/T106165#1460756 (10Manybubbles) [16:58:48] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Release swift-repository for 1.7.0 - https://phabricator.wikimedia.org/T106163#1460758 (10Manybubbles) [16:58:55] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Release experimental-highlighter for 1.7.0 - https://phabricator.wikimedia.org/T106162#1460760 (10Manybubbles) [16:59:20] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1460762 (10Manybubbles) [16:59:39] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Release wikimedia-extra plugin for Elasticsearch 1.6.0 - https://phabricator.wikimedia.org/T106161#1460765 (10Manybubbles) a:3Manybubbles [16:59:57] http://candidhosting.com/ WikiPedia Depends on PowerMedium? ha [17:01:30] o_O [17:01:33] candid. [17:01:51] yeah [17:02:19] old pmtpa times come back [17:03:26] Does that violate trademark policy? [17:05:38] I don't know [17:07:46] SPF|Cloud: it still says that? [17:08:04] yeah [17:08:18] which page specifically, cuz i can forward it on to communications [17:08:19] http://gyazo.com/56994234bd57b650514d6b833f9f8926 [17:08:21] they would handle that [17:08:33] oh, in the image.... [17:08:35] Just the index page of http://candidhosting.com [17:08:41] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1460783 (10Tau) Should the open_basedir be enabled then? What directory I should set to open_basedir? If I change the owner... [17:08:58] SPF|Cloud: huh, good catch, old pmtpa and even then [17:09:05] we dont let folks just list the brand for fun. [17:09:31] robh: thanks to Wikimedia Commons :) [17:09:39] and thx for forwarding that [17:11:07] yea i dunno if this is something where mark may know, so i'll be emailing coms and ccing him [17:11:13] coms/legal i suppose [17:11:14] No problem [17:18:22] heh, first tile of every page load [17:18:38] i dunno what it says, but im more annoyed by 'WikiPedia' than by unauthorized use. [17:19:19] :') me too [17:20:33] plus all the statistics on servers are so outdated =P [17:20:38] yeah [17:20:52] So: what do we depend on? ;) [17:20:52] so yea, i emailed our tm team, but they are all at wikimania ;D [17:21:05] i just got all the out of office replies from them, heh. [17:21:16] I laughed at the "39 dedicated servers" [17:21:16] ololol [17:21:28] jynus: yea srsly we have had more than that since i was hired [17:21:32] robh! [17:21:38] when i started we had 8 racks! [17:21:40] 39 dedicated servers [17:21:49] 8 racks and srv instead of mw? [17:22:00] (naming scheme) [17:22:06] well, 8 racks, 5 of which had servers, 1 network, and then 2 racks for wikia and toollabs and stuffs [17:22:23] but yea, i was never around for 39 servers. [17:22:51] Since when are you employed? 2005? [17:22:58] Or earlier [17:23:08] dec 2006 [17:23:13] oh lol [17:23:28] this banner has been running since before i was hired. [17:23:35] its kinda made my day ;D [17:23:59] My question remains: what do we depend on then? [17:24:47] well, when we first started they were our datacenter. [17:25:00] but when we shut down tampa, they stopped being a vendor. [17:25:11] * robh saves the flash file for fun [17:26:00] "Jim Wales - Wikipedia Founder" [17:27:37] lol [17:28:33] robh: I apparently either can't download it or play it [17:28:52] can you give me the correct download link thx [17:28:58] oh, i did the firefox 'save page' and then pulled the file out of that [17:29:13] I guess http://candidhosting.com/pm_intro.swf though [17:29:15] as it saves the entire page and pushes all the content into a sub-direcotry next to the html file [17:29:16] yep! [17:30:26] And downloaded (: [17:53:15] 6operations, 10Analytics, 6MediaWiki-Stakeholders-Group, 10Traffic, 10Wikimania-Hackathon-2015: Provide summary of MediaWiki downloads - https://phabricator.wikimedia.org/T104010#1460865 (10Ckoerner) [17:53:33] don't click on random swf files! [17:53:34] 6operations, 10Analytics, 6MediaWiki-Stakeholders-Group, 10Traffic, 10Wikimania-Hackathon-2015: Provide summary of MediaWiki downloads - https://phabricator.wikimedia.org/T104010#1405390 (10Ckoerner) [17:58:20] (03CR) 10Andrew Bogott: [C: 032] Set up a keypair for cold migration. [puppet] - 10https://gerrit.wikimedia.org/r/225332 (https://phabricator.wikimedia.org/T106145) (owner: 10Andrew Bogott) [18:11:29] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [18:17:08] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Test bad PMCID returned the unexpected status 200 (expecting: 404) [18:24:59] that's a pretty weird error [18:25:09] also, what's a PMCID? [18:28:50] meh sca1001:/var/log/citoid/main.log has no timestamps [18:29:19] oh i take that back, they're inside the json instead of prepended [18:36:16] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Release experimental-highlighter for 1.7.0 - https://phabricator.wikimedia.org/T106162#1460917 (10Manybubbles) a:3Manybubbles [18:37:47] (03PS1) 10Andrew Bogott: More clearly define the novasync role on compute vs. controller nodes. [puppet] - 10https://gerrit.wikimedia.org/r/225351 [18:42:50] (03CR) 10Andrew Bogott: [C: 032] More clearly define the novasync role on compute vs. controller nodes. [puppet] - 10https://gerrit.wikimedia.org/r/225351 (owner: 10Andrew Bogott) [18:44:31] i am looking at this citoid problem but i don't know what would make it return 200 when 404 is expected. anybody have insight? [18:45:05] i ran the nrpe check by hand and ngrepped and looked at the log, but the problem is not apparent [18:45:14] (03PS1) 10Andrew Bogott: novamigrate.pp -> migrate.pp [puppet] - 10https://gerrit.wikimedia.org/r/225354 [18:46:49] (03CR) 10Andrew Bogott: [C: 032] novamigrate.pp -> migrate.pp [puppet] - 10https://gerrit.wikimedia.org/r/225354 (owner: 10Andrew Bogott) [18:49:37] PROBLEM - puppet last run on labcontrol1001 is CRITICAL puppet fail [18:50:26] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1460950 (10Tgr) If it is not a large production website, I would just do `chmod -R a+rw /var/log/mediawiki`. [18:50:28] PROBLEM - puppet last run on labvirt1009 is CRITICAL puppet fail [18:51:01] (03PS1) 10Andrew Bogott: Apparently a class name can't start with a space. [puppet] - 10https://gerrit.wikimedia.org/r/225356 [18:53:29] (03CR) 10Andrew Bogott: [C: 032] Apparently a class name can't start with a space. [puppet] - 10https://gerrit.wikimedia.org/r/225356 (owner: 10Andrew Bogott) [18:54:58] PROBLEM - puppet last run on labcontrol1002 is CRITICAL puppet fail [18:56:04] (03PS1) 10Andrew Bogott: Don't require nova-compute. [puppet] - 10https://gerrit.wikimedia.org/r/225357 [18:58:27] (03CR) 10Andrew Bogott: [C: 032] Don't require nova-compute. [puppet] - 10https://gerrit.wikimedia.org/r/225357 (owner: 10Andrew Bogott) [19:06:48] PROBLEM - puppet last run on cp1047 is CRITICAL Puppet has 1 failures [19:14:55] (03PS1) 10Andrew Bogott: Specify a few more settings for nova::migration on controller [puppet] - 10https://gerrit.wikimedia.org/r/225361 [19:15:31] 6operations, 7network: set up a looking glass for WMF ASes - https://phabricator.wikimedia.org/T106056#1461060 (10faidon) p:5Triage>3Lowest That would be nice indeed, but needs a bit of work to do it properly: - Opening up a web interface to our routers directly like many others do is a bad idea IMO, as it... [19:15:46] (03CR) 10jenkins-bot: [V: 04-1] Specify a few more settings for nova::migration on controller [puppet] - 10https://gerrit.wikimedia.org/r/225361 (owner: 10Andrew Bogott) [19:16:50] (03PS2) 10Andrew Bogott: Specify a few more settings for nova::migration on controller [puppet] - 10https://gerrit.wikimedia.org/r/225361 [19:18:07] RECOVERY - puppet last run on labvirt1009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:19:08] andrewbogott: labnodepool1001 cronspam, sudo errors [19:19:45] paravoid: just now? There was a flood of them yesterday but I haven’t gotten any today. [19:20:00] (And yesterday’s problem, at least, is resolved) [19:20:55] I see a few ~3 hours ago [19:21:04] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1461076 (10bd808) Trusty has a ruby2.0 package that can be installed and made the default interpreter: ``` sudo apt-get install ruby2.0 ruby2.0-dev sudo update-alternat... [19:21:33] 10Ops-Access-Requests, 6operations: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1461079 (10TJones) 3NEW [19:21:41] oh yeah, I guess I saw those too. I think hasher (and the cronspam) is done for the day though :) [19:22:09] (03CR) 10Andrew Bogott: [C: 032] Specify a few more settings for nova::migration on controller [puppet] - 10https://gerrit.wikimedia.org/r/225361 (owner: 10Andrew Bogott) [19:22:30] ok :) [19:24:58] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:33:19] (03PS1) 10Hashar: nodepool: point to Zuul DNS service entry [puppet] - 10https://gerrit.wikimedia.org/r/225370 [19:34:14] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1461109 (10bd808) Just to see what the next hurdle would be I installed the needed gems manually. ``` sudo apt-get build-dep ruby-nokogiri sudo apt-get install zlib1g-de... [19:34:17] RECOVERY - puppet last run on cp1047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:40:59] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Sudo permissions for hdfs user on analytics-hadoop - https://phabricator.wikimedia.org/T104020#1461143 (10Ottomata) [19:41:42] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Sudo permissions for hdfs user on analytics-hadoop - https://phabricator.wikimedia.org/T104020#1461146 (10Ottomata) @kevinator please approve. This will need to be discussed at an Ops meeting. The relevant group is analytics-admins, this grants sudo... [19:51:49] RECOVERY - puppet last run on labcontrol1002 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [20:01:06] akosiaris, ping :) [20:03:24] !log stopping Zuul to get rid of a faulty registered function "build:Global-Dev Dashboard Data". Job is gone already. [20:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:19] >

Queue only mode: preparing to exit, queue length: 3

[20:10:15] (03PS2) 10Faidon Liambotis: check_ssl: add support for picking the auth algorithm [puppet] - 10https://gerrit.wikimedia.org/r/224860 [20:10:17] (03CR) 10BryanDavis: "The backported vagrant 1.7.2+dfsg-4 package from sid requires Ruby 2.*. Trusty has Ruby 1.9. :((" [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) (owner: 10BryanDavis) [20:10:22] (03CR) 10Faidon Liambotis: [C: 032 V: 032] check_ssl: add support for picking the auth algorithm [puppet] - 10https://gerrit.wikimedia.org/r/224860 (owner: 10Faidon Liambotis) [20:12:40] (03PS1) 10Hashar: Stop all threads on SIGUSR1 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/225378 [20:18:29] (03PS1) 10Alexandros Kosiaris: maps: populate postgres granting SQL script [puppet] - 10https://gerrit.wikimedia.org/r/225401 [20:18:44] yurik: pong [20:18:58] yurik: https://gerrit.wikimedia.org/r/225401 wanna check those rights are OK ? [20:19:08] (03PS1) 10Alex Monk: Use NewUserMessage on gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225409 (https://phabricator.wikimedia.org/T106169) [20:19:12] hey, we are in adelita, 4th fl, behind the elevators [20:19:47] (03PS1) 10Hashar: Stop all threads on SIGUSR1 [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/225410 [20:19:51] yurik: I think they are what is described in https://phabricator.wikimedia.org/T105070#1460499 [20:19:58] yurik: ok, I will be there in a few [20:20:15] (03Abandoned) 10Hashar: Stop all threads on SIGUSR1 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/225378 (owner: 10Hashar) [20:21:11] (03CR) 10Hashar: [C: 04-2] "Not meant to be merged, that is applied to the Debian package as a quilt patch." [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/225410 (owner: 10Hashar) [20:21:31] akosiaris, we'll be using password auth even though postgres supports unix auth? [20:22:28] (03CR) 10Yurik: [C: 031] maps: populate postgres granting SQL script [puppet] - 10https://gerrit.wikimedia.org/r/225401 (owner: 10Alexandros Kosiaris) [20:26:40] 6operations: asw-b-eqiad:ge-5/0/1(nas1001-a:e0a) port saturation - https://phabricator.wikimedia.org/T106181#1461280 (10RobH) [20:28:55] (03PS2) 10Hashar: Stop all threads on SIGUSR1 [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/225410 [20:29:24] (03PS3) 10Hashar: Stop all threads on SIGUSR1 [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/225410 [20:29:40] MaxSem: yup [20:29:58] akosiaris, does it increase security? [20:30:24] so when we say we're running swift for shared storage - now that I'm looking into it that is just an api specification. [20:30:42] no, it's more flexible and less bothersome. We have to increase unix accounts for unix auth to work and we don't really want that [20:30:42] 6operations: asw-b-eqiad:ge-5/0/1(nas1001-a:e0a) port saturation - https://phabricator.wikimedia.org/T106181#1461315 (10RobH) I realize that half of the listed nas ports are the redundant backplane. My main question is if we can bond the ports on the single backplane. [20:31:04] or am I missing something [20:32:07] looks like I'm missing something [20:32:17] 6operations, 7user-notice: schedule maintenance for IRC server - https://phabricator.wikimedia.org/T105804#1461321 (10gpaumier) As it currently stands, I can't communicate this to users since there is no date and time given here. The tech newsletter will be frozen for translation in a few hours, and unless add... [20:34:13] 6operations: asw-b-eqiad:ge-5/0/1(nas1001-a:e0a) port saturation - https://phabricator.wikimedia.org/T106181#1461332 (10akosiaris) Isn't the plan to deprecate and remove both nas1001-a and nas1001-b ? In fact I am moving some final data away from it and to helium's disk shelf which might be why we see the alarm.... [20:35:38] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 and eventlogging for legoktm - https://phabricator.wikimedia.org/T106184#1461337 (10Legoktm) 3NEW [20:38:02] (03PS4) 10Hashar: Support spaces in Gearman functions names [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/205564 [20:38:04] (03PS4) 10Hashar: Stop all threads on SIGUSR1 [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/225410 [20:45:49] (03PS2) 10Alexandros Kosiaris: etherpad: switch to HTTPS-only (redirect, HSTS) [puppet] - 10https://gerrit.wikimedia.org/r/224823 (owner: 10Faidon Liambotis) [20:45:57] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] etherpad: switch to HTTPS-only (redirect, HSTS) [puppet] - 10https://gerrit.wikimedia.org/r/224823 (owner: 10Faidon Liambotis) [20:46:01] (03PS1) 10Hashar: wmf4: patch to stop all threads on SIGUSR1 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/225475 [20:46:18] and now the hardest part, figure out how to build the .deb package :D [20:49:07] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [20:49:16] 6operations, 6Reading-Admin, 10Traffic, 7HTTPS, and 2 others: TLS and *.wap/*.mobile multi-level subdomains of wikipedia.org - https://phabricator.wikimedia.org/T104942#1461379 (10Tnegrin) This has been an enlightening thread in many ways! Thanks everybody. I'm not seeing anything in the longer queries th... [20:52:26] (03PS2) 10Alexandros Kosiaris: maps: populate postgres granting SQL script [puppet] - 10https://gerrit.wikimedia.org/r/225401 [20:52:32] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: populate postgres granting SQL script [puppet] - 10https://gerrit.wikimedia.org/r/225401 (owner: 10Alexandros Kosiaris) [20:53:19] !log Manually fixed issue in mediawikiwiki LQT thread table with rename of Ecliptica to Entropy. https://phabricator.wikimedia.org/T106122#1461380 [20:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:56:00] (03PS1) 10Alexandros Kosiaris: maps: lint the role [puppet] - 10https://gerrit.wikimedia.org/r/225478 [20:56:24] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: lint the role [puppet] - 10https://gerrit.wikimedia.org/r/225478 (owner: 10Alexandros Kosiaris) [20:58:48] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [21:09:48] (03PS1) 10Manybubbles: Upgrade swift repository [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/225483 [21:12:29] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Release swift-repository for 1.7.0 - https://phabricator.wikimedia.org/T106163#1461434 (10Manybubbles) a:3Manybubbles [21:18:46] mutante: around ? [21:21:17] (03PS6) 10Lokal Profil: Add DCAT-AP for Wikibase [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) [21:23:26] (03CR) 10Lokal Profil: "patch 6 is essentially a revert to patch 4 since intuition does not need prefixes per https://github.com/Krinkle/intuition/pull/50" [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [21:27:43] 10Ops-Access-Requests, 6operations: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1461460 (10Deskana) [21:30:47] 10Ops-Access-Requests, 6operations: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1461469 (10Wwes) Approved [21:35:07] PROBLEM - HHVM rendering on mw2105 is CRITICAL - Socket timeout after 10 seconds [21:35:17] (03PS1) 10Andrew Bogott: Revert "Set up a keypair for cold migration" and all followups. [puppet] - 10https://gerrit.wikimedia.org/r/225486 [21:35:19] (03PS1) 10Andrew Bogott: New, python-based migrate script. [puppet] - 10https://gerrit.wikimedia.org/r/225487 [21:36:57] RECOVERY - HHVM rendering on mw2105 is OK: HTTP OK: HTTP/1.1 200 OK - 72821 bytes in 0.368 second response time [21:37:14] (03CR) 10Andrew Bogott: [C: 032] Revert "Set up a keypair for cold migration" and all followups. [puppet] - 10https://gerrit.wikimedia.org/r/225486 (owner: 10Andrew Bogott) [21:37:25] (03CR) 10Andrew Bogott: [C: 032] New, python-based migrate script. [puppet] - 10https://gerrit.wikimedia.org/r/225487 (owner: 10Andrew Bogott) [21:38:10] (03PS1) 10Alexandros Kosiaris: maps: Set postgres/cassandra data directories [puppet] - 10https://gerrit.wikimedia.org/r/225490 [21:39:50] (03CR) 10Ori.livneh: "No comments?" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/225487 (owner: 10Andrew Bogott) [21:40:12] (03PS2) 10Alexandros Kosiaris: maps: Set postgres/cassandra data directories [puppet] - 10https://gerrit.wikimedia.org/r/225490 [21:45:49] (03PS3) 10Alexandros Kosiaris: maps: Set postgres/cassandra data directories [puppet] - 10https://gerrit.wikimedia.org/r/225490 [21:45:57] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: Set postgres/cassandra data directories [puppet] - 10https://gerrit.wikimedia.org/r/225490 (owner: 10Alexandros Kosiaris) [21:49:18] PROBLEM - puppet last run on maps-test2001 is CRITICAL Puppet has 4 failures [21:49:37] PROBLEM - puppet last run on maps-test2004 is CRITICAL Puppet has 1 failures [21:50:08] akosiaris: ^ are these on ganeti or are these real boxen? [21:50:33] YuviPanda: real boxes. poweredge R610 [21:50:36] ignore for now [21:50:39] ah, nice [21:53:12] (03PS4) 10Hashar: nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 [21:53:20] (03CR) 10jenkins-bot: [V: 04-1] nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [21:53:48] PROBLEM - puppet last run on maps-test2002 is CRITICAL Puppet has 1 failures [21:53:49] PROBLEM - puppet last run on maps-test2003 is CRITICAL Puppet has 1 failures [21:54:45] (03PS1) 10Alexandros Kosiaris: maps: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/225498 [21:54:50] (03CR) 10jenkins-bot: [V: 04-1] maps: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/225498 (owner: 10Alexandros Kosiaris) [21:55:07] (03PS9) 10Madhuvishy: [WIP] wikilabels: Introducing module [puppet] - 10https://gerrit.wikimedia.org/r/225092 [21:55:51] (03PS2) 10Alexandros Kosiaris: maps: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/225498 [21:55:57] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/225498 (owner: 10Alexandros Kosiaris) [21:56:34] 6operations: Add tmux to maps (or other) servers - https://phabricator.wikimedia.org/T106191#1461505 (10Yurik) 3NEW a:3akosiaris [21:56:42] (03CR) 10Hashar: "So I have hit a bug in Nodepool which does not properly stops all its threads on graceful stop. I proposed a patch upstream and bumped ou" [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [21:57:07] ottomata: around? [21:57:53] (03CR) 10Hashar: [C: 032 V: 032] wmf4: patch to stop all threads on SIGUSR1 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/225475 (owner: 10Hashar) [21:57:54] yup [21:57:59] YuviPanda: on 6th floor couches [21:58:05] there is sunlight here! [21:58:06] there's a sixth floor thing?! [21:58:12] terrace [21:58:22] ottomata: madhuvishy just finished her first puppet module! wanna review? :) [21:58:22] hax [21:58:25] sure! [21:58:54] ottomata: https://gerrit.wikimedia.org/r/#/c/225092/ minor fixes coming up [21:59:03] (03PS1) 10Andrew Bogott: A few style updates + some docs [puppet] - 10https://gerrit.wikimedia.org/r/225499 [22:00:55] (03CR) 10Yuvipanda: A few style updates + some docs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/225499 (owner: 10Andrew Bogott) [22:01:04] andrewbogott: these are *all* trusty hosts now, right? [22:01:18] ottomata: I'll merge after you've taken a look, etc :) [22:01:19] (03PS2) 10Andrew Bogott: A few style updates + some docs [puppet] - 10https://gerrit.wikimedia.org/r/225499 [22:01:28] YuviPanda: ‘these’? [22:01:37] andrewbogott: I mean, virt* hosts [22:01:39] labvirt* [22:01:43] labvirt, yes [22:01:58] Still waiting for hardware to upgrade labnet though [22:02:07] andrewbogott: hmm, I wonder if we should just make all new scripts py3 - all labstore ones are py3 [22:02:25] YuviPanda: looking in just a few mins... [22:02:29] ottomata: cool! [22:03:26] (03PS5) 10Hashar: nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) [22:03:32] (03CR) 10jenkins-bot: [V: 04-1] nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [22:03:34] (03PS1) 10Alexandros Kosiaris: maps: fix just another typo [puppet] - 10https://gerrit.wikimedia.org/r/225500 [22:03:39] (03CR) 10jenkins-bot: [V: 04-1] maps: fix just another typo [puppet] - 10https://gerrit.wikimedia.org/r/225500 (owner: 10Alexandros Kosiaris) [22:03:44] YuviPanda: where do ‘’’docstrings’’’ go if they are file-wide? Just stick ‘em at the top? [22:03:47] (03PS6) 10Hashar: nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) [22:03:48] Or do they have to belong in an object? [22:03:59] andrewbogott: just stick 'em at the top yeah [22:04:12] + I’m only planning to work for another 15 minutes so probably won’t rewrite in 3 today. You’re welcome to if you want though... [22:04:19] mostly I’m just trying to get you some sort of migrate tool before I vanish [22:04:23] since our old system is broken [22:04:26] (03PS2) 10Alexandros Kosiaris: maps: fix just another typo [puppet] - 10https://gerrit.wikimedia.org/r/225500 [22:04:33] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: fix just another typo [puppet] - 10https://gerrit.wikimedia.org/r/225500 (owner: 10Alexandros Kosiaris) [22:04:38] andrewbogott: ok! I'll rewrite it later :) [22:04:50] madhuvishy: pykafka available in wikimedia apt [22:05:10] (03CR) 10Hashar: "Finally good for reviewing. I have installed it manually on labnodepool1001.eqiad.wmnet and disabled puppet." [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [22:05:25] (03PS10) 10Madhuvishy: wikilabels: Puppetizing wikilabels infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/225092 [22:05:32] ottomata: yay [22:05:36] YuviPanda: role::labs::? [22:06:07] ottomata: yeah, that's what I've been doing for all these labs specific things [22:06:21] why ::labs:: before the real calss name? [22:06:22] why not [22:06:27] role::wikilables::labs [22:06:29] ? [22:06:30] (03PS3) 10Andrew Bogott: A few style updates + some docs [puppet] - 10https://gerrit.wikimedia.org/r/225499 [22:06:37] (03PS5) 10Hashar: nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 [22:06:42] oh cool! I didn' know there was a host resource! [22:06:45] ottomata: because that is a terrible pattern I killed with a lot of fire at some point :) [22:06:54] because you end up with ::production, ::labs and stuff :) [22:07:07] labs:: is purely stylistic - we can get rid of it if you want [22:07:19] why is ::labs worse than labs::? [22:07:55] (I agree with you though, I avoid environment based classes if possible) [22:08:05] but, with hiera, you shouldn't need to differentiate, right? [22:08:12] (03CR) 10Hashar: "Finally ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [22:08:29] ottomata: yeah, we can kill it if you want to [22:08:49] Ja lets unless you really relaly like it and want to argue for it :) [22:08:54] ottomata: no :P [22:09:03] YuviPanda: maybe just add a comment that this is only intended for labs? [22:09:07] (for now?) [22:09:37] ottomata: we can do a if $::realm != 'labs' { BLOW UP EVERYTHING } [22:09:47] haha, sure, i think we do that for wikimetrics [22:10:00] (03PS1) 10Alexandros Kosiaris: maps: Include osm in role class [puppet] - 10https://gerrit.wikimedia.org/r/225503 [22:10:01] another q [22:10:13] hmm, naw, nm [22:10:43] YuviPanda: is 'venv' a convention I am not aware of? [22:11:01] ottomata: this is the virtualenv based setup I've been doing of late [22:11:14] ok, you just like abbreviating? [22:11:14] hehe [22:11:28] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Nodepool, 5Patch-For-Review: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1461538 (10hashar) Status update ------ I have finished the systemd patch: https://gerrit.wikimedia.org/r/#/c/224102/... [22:11:36] ottomata: oh, I've heard it called venv for a long time [22:11:42] ottomata: that's also the name of the uwsgi config param [22:13:01] ok thats fine then [22:19:50] YuviPanda: madhuvishy, probably fine in labs, because there are few processors there [22:19:55] but processorcount * 4 is a lot, no? [22:20:03] maybe put a min on there? [22:20:06] ottomata: nah, it's mostly ok - this isn't CPU bound [22:20:17] RECOVERY - puppet last run on maps-test2001 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:20:43] somethiung like min(8, processorcount*4) would be better, but i dont really care since this is just labs :) [22:20:56] pffft, stop saying 'just labs' :P [22:21:04] haha [22:21:12] since this is awesome labs [22:21:13] why would you want to set a min? [22:21:31] if there were 24 procs, you'd get 96 python processes [22:21:59] hmm, but why would a min help there? [22:22:06] (03PS4) 10Andrew Bogott: cold-migrate: A few style updates + some docs [puppet] - 10https://gerrit.wikimedia.org/r/225499 [22:22:06] if anything that calls for a max :D [22:22:27] who's max? [22:24:05] ori: hoo's max? [22:25:14] I don't think he is [22:26:00] who does! [22:26:18] hoo does? [22:26:25] hoo knows? [22:26:33] yes!~ [22:27:02] {{cn}} [22:27:28] Seen who? [22:27:53] YuviPanda: min(8, 96) == 8 [22:27:57] lol [22:28:00] gj ottomata [22:28:02] ottomata: oh, lol, haha right >_> [22:28:03] optimisation [22:28:16] but YuviPanda, don't worry about that, what you have is fine [22:29:10] (03CR) 10Ottomata: wikilabels: Puppetizing wikilabels infrastructure (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/225092 (owner: 10Madhuvishy) [22:29:14] madhuvishy: ^ [22:29:17] YuviPanda: ^ [22:29:36] (03CR) 10Alexandros Kosiaris: [C: 032] maps: Include osm in role class [puppet] - 10https://gerrit.wikimedia.org/r/225503 (owner: 10Alexandros Kosiaris) [22:29:43] hmmmm [22:30:19] ottomata: thanks :) [22:34:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 29046 seconds ago, expected 28800 [22:37:37] akosiaris, dump done at http://ns512621.ip-167-114-156.net/static/cassandra.tar.gz [22:38:43] I can't download it myself because I can't write to /srv [22:38:45] (03PS1) 10Alexandros Kosiaris: maps: Use the correct classes in hiera [puppet] - 10https://gerrit.wikimedia.org/r/225507 [22:39:26] MaxSem: ok, downloading [22:39:38] MaxSem: /tmp or gtfo [22:39:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 29346 seconds ago, expected 28800 [22:40:25] (03CR) 10Alexandros Kosiaris: [C: 032] maps: Use the correct classes in hiera [puppet] - 10https://gerrit.wikimedia.org/r/225507 (owner: 10Alexandros Kosiaris) [22:40:31] Reedy, 40 gigs [22:41:48] RECOVERY - puppet last run on maps-test2004 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [22:42:08] RECOVERY - puppet last run on maps-test2002 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:42:08] RECOVERY - puppet last run on maps-test2003 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [22:44:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 29646 seconds ago, expected 28800 [22:49:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 29947 seconds ago, expected 28800 [22:50:44] (03PS1) 10Alex Monk: Add basic contact form for stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225509 (https://phabricator.wikimedia.org/T98625) [22:54:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 30247 seconds ago, expected 28800 [22:59:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 30546 seconds ago, expected 28800 [23:04:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 30846 seconds ago, expected 28800 [23:09:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 31146 seconds ago, expected 28800 [23:14:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 31446 seconds ago, expected 28800 [23:17:44] (03PS1) 10Alexandros Kosiaris: maps: give login rights to postgres users [puppet] - 10https://gerrit.wikimedia.org/r/225511 [23:19:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 31746 seconds ago, expected 28800 [23:24:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 32046 seconds ago, expected 28800 [23:29:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 32346 seconds ago, expected 28800 [23:34:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 32646 seconds ago, expected 28800 [23:39:48] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 32947 seconds ago, expected 28800 [23:43:05] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1461637 (10Yurik) [23:44:48] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 33246 seconds ago, expected 28800 [23:45:33] 6operations, 10Wikimedia-Site-requests, 10Wikimedia-Video: Upload the Wikimania 2014 videos to Commons - https://phabricator.wikimedia.org/T106038#1461646 (10Matanya) [23:46:54] 6operations, 10Wikimedia-Site-requests, 10Wikimedia-Video: Upload the Wikimania 2014 videos to Commons - https://phabricator.wikimedia.org/T106038#1457394 (10Matanya) The issue here is swift can't handle this size of converted videos. I will go a head and split them to small, per talk videos and upload that... [23:48:52] 6operations, 6Discovery, 10Maps, 6Services, 3Discovery-Maps-Sprint: Puppetize Kartotherian for maps deployment - https://phabricator.wikimedia.org/T105074#1461651 (10Yurik) Service: kartotherian Description: a web server to generate vector and raster map tiles Git: maps/kartotherian/deploy Service: tile... [23:49:10] 6operations, 6Discovery, 10Maps, 6Services, 3Discovery-Maps-Sprint: Puppetize Kartotherian & Tilerator for deployment - https://phabricator.wikimedia.org/T105074#1461652 (10Yurik) [23:49:48] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 33546 seconds ago, expected 28800 [23:54:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 33846 seconds ago, expected 28800 [23:59:47] PROBLEM - check_puppetrun on heka is CRITICAL Puppet last ran 34146 seconds ago, expected 28800