[00:02:44] (03PS1) 10Ori.livneh: misc::statistics::packages: add libyaml-cpp0.3-dev [puppet] - 10https://gerrit.wikimedia.org/r/175390 [00:02:46] (03PS1) 10Ori.livneh: hhvm: set nofile limit to unlimited [puppet] - 10https://gerrit.wikimedia.org/r/175391 [00:03:54] (03PS2) 10Ori.livneh: misc::statistics::packages: add libyaml-cpp0.3-dev [puppet] - 10https://gerrit.wikimedia.org/r/175390 [00:04:04] (03CR) 10Ori.livneh: [C: 032 V: 032] misc::statistics::packages: add libyaml-cpp0.3-dev [puppet] - 10https://gerrit.wikimedia.org/r/175390 (owner: 10Ori.livneh) [00:58:34] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:59:44] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 72018 bytes in 0.334 second response time [02:10:15] !log l10nupdate Synchronized php-1.25wmf8/cache/l10n: (no message) (duration: 00m 01s) [02:10:21] !log LocalisationUpdate completed (1.25wmf8) at 2014-11-24 02:10:21+00:00 [02:10:28] Logged the message, Master [02:10:30] Logged the message, Master [02:16:59] !log l10nupdate Synchronized php-1.25wmf9/cache/l10n: (no message) (duration: 00m 01s) [02:17:03] !log LocalisationUpdate completed (1.25wmf9) at 2014-11-24 02:17:02+00:00 [02:17:03] Logged the message, Master [02:17:06] Logged the message, Master [03:31:26] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Nov 24 03:31:25 UTC 2014 (duration 31m 24s) [03:31:29] Logged the message, Master [03:32:45] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: Puppet has 3 failures [03:35:08] (03CR) 1020after4: [C: 031] misc varnish: do not handle bz-attachment URLs [puppet] - 10https://gerrit.wikimedia.org/r/175128 (owner: 10Dzahn) [03:35:38] (03PS2) 1020after4: bugzilla: delete bugs.wikipedia.org vhost [puppet] - 10https://gerrit.wikimedia.org/r/175136 (owner: 10Dzahn) [03:35:54] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [03:36:03] (03CR) 1020after4: [C: 031] bugzilla: delete bugs.wikipedia.org vhost [puppet] - 10https://gerrit.wikimedia.org/r/175136 (owner: 10Dzahn) [03:36:18] (03PS2) 1020after4: bugzilla: delete bugzilla.wikiPedia.org [puppet] - 10https://gerrit.wikimedia.org/r/175139 (owner: 10Dzahn) [03:36:37] (03CR) 1020after4: [C: 031] bugzilla: delete bugzilla.wikiPedia.org [puppet] - 10https://gerrit.wikimedia.org/r/175139 (owner: 10Dzahn) [05:21:35] PROBLEM - HHVM busy threads on mw1114 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [90.0] [05:29:45] RECOVERY - HHVM busy threads on mw1114 is OK: OK: Less than 1.00% above the threshold [60.0] [06:27:55] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [06:27:57] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:07] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:07] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:55] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:05] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:46:24] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:45] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:47:25] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:47:56] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.049 second response time [06:59:30] (03PS1) 10Ori.livneh: hhvm::debug: provision perf-tools [puppet] - 10https://gerrit.wikimedia.org/r/175399 [07:00:22] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm::debug: provision perf-tools [puppet] - 10https://gerrit.wikimedia.org/r/175399 (owner: 10Ori.livneh) [07:43:29] <_joe_> hey ori [07:43:38] hey _joe_ [07:43:49] good morning [07:43:57] <_joe_> good night! [07:43:59] <_joe_> :) [07:44:12] how was your weekend? [07:44:33] <_joe_> good, and marvelously uninterrupted by any issue [07:44:46] <_joe_> so it seems to me tomorrow is going to be our big day [07:47:08] cool! [08:16:46] PROBLEM - check if salt-minion is running on mw1114 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:19:16] PROBLEM - SSH on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:19:57] PROBLEM - RAID on palladium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:19:57] PROBLEM - puppet last run on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:27] PROBLEM - check configured eth on palladium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:57] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [08:21:16] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:21:37] PROBLEM - check if dhclient is running on palladium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:22:26] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:16] RECOVERY - SSH on palladium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [08:23:36] RECOVERY - check configured eth on palladium is OK: NRPE: Unable to read output [08:23:36] RECOVERY - check if dhclient is running on palladium is OK: PROCS OK: 0 processes with command name dhclient [08:24:07] PROBLEM - puppet last run on palladium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:07] PROBLEM - DPKG on palladium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:57] RECOVERY - RAID on palladium is OK: RAID STATUS: OPTIMAL [08:24:57] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [08:24:58] RECOVERY - DPKG on palladium is OK: All packages OK [08:25:17] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 4.279 second response time [08:28:36] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 2.615 second response time [08:30:06] PROBLEM - check if salt-minion is running on mw1114 is CRITICAL: PROCS CRITICAL: 6 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:32:36] PROBLEM - check if salt-minion is running on palladium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:33:37] RECOVERY - check if salt-minion is running on palladium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:33:37] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:56] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:56] PROBLEM - check configured eth on palladium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:33:57] PROBLEM - check if dhclient is running on palladium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:34:27] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:34:32] PROBLEM - puppet last run on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:34:52] RECOVERY - check configured eth on palladium is OK: NRPE: Unable to read output [08:34:52] RECOVERY - check if dhclient is running on palladium is OK: PROCS OK: 0 processes with command name dhclient [08:35:27] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [08:36:27] PROBLEM - check if dhclient is running on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:36:37] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:37] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:36:56] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:36:56] PROBLEM - check configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:36:57] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:37:06] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:37:16] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:37:17] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:17] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:47] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 7.899 second response time [08:39:26] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 1 failures [08:39:36] RECOVERY - check if dhclient is running on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [08:39:37] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 72018 bytes in 5.832 second response time [08:39:56] RECOVERY - Disk space on mw1114 is OK: DISK OK [08:39:56] RECOVERY - check configured eth on mw1114 is OK: NRPE: Unable to read output [08:39:57] RECOVERY - DPKG on mw1114 is OK: All packages OK [08:39:57] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.946 second response time [08:39:57] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [08:40:07] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [08:40:10] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:40:16] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [08:40:37] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 1 process with command name hhvm [08:40:37] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [08:40:46] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 33 minutes ago with 0 failures [08:42:36] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Puppet has 1 failures [08:43:36] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:43:36] PROBLEM - check if salt-minion is running on mw1114 is CRITICAL: PROCS CRITICAL: 6 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:44:26] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [08:44:36] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures [08:48:26] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:49:57] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:10] <_joe_> ori: ping [08:52:01] <_joe_> nevermind [08:53:16] PROBLEM - puppet last run on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:57] PROBLEM - check if dhclient is running on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:58] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:17] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:54:26] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:28] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:29] PROBLEM - check configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:36] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:46] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:47] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:54:47] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:54:50] <_joe_> I know what's wrong with this machine [08:54:53] <_joe_> apergos: ping [08:54:57] PROBLEM - RAID on palladium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:27] _joe_: h [08:55:35] ah... sigh [08:55:41] sorry. that would be me, shot it [08:55:53] <_joe_> apergos: does your audit script grep files in /root? [08:55:58] PROBLEM - puppet last run on palladium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:56:15] it shouldn't but I forgot to pss the uid cutoff [08:56:16] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:30] <_joe_> because I think it's killing palladium and mw1114 [08:56:37] <_joe_> mw1114, not a real issue [08:56:56] RECOVERY - RAID on palladium is OK: RAID STATUS: OPTIMAL [08:57:00] well that won't happen again. however, there's something else broken anyways so [08:57:00] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures [08:57:06] RECOVERY - check if dhclient is running on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [08:57:15] <_joe_> but palladium is [08:57:17] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [08:57:17] RECOVERY - Disk space on mw1114 is OK: DISK OK [08:57:17] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 4.446 second response time [08:57:17] RECOVERY - check configured eth on mw1114 is OK: NRPE: Unable to read output [08:57:17] RECOVERY - DPKG on mw1114 is OK: All packages OK [08:57:23] <_joe_> so, please kill int on palladium [08:57:27] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [08:57:28] <_joe_> *it [08:57:36] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [08:57:37] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [08:57:57] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 1 process with command name hhvm [08:58:20] already had [08:58:36] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 7.211 second response time [08:58:47] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.022 second response time [08:59:17] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 72018 bytes in 1.016 second response time [08:59:35] it should only be looking in one subdirectory for any of these, obviously I broke something in the script, it won't happen again [08:59:57] RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:00:47] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:00:57] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:04:17] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:05:57] <_joe_> !log restarted apache2 on palladium, apache and hhvm on mw1114; killed audit script leftovers on mw1114 [09:06:06] RECOVERY - check if salt-minion is running on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:07:26] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:14:35] thanks; I restarted saltmaster on palladium and killed the old jobs cache too [09:17:14] <_joe_> ok thanks [09:32:26] (03PS4) 10Filippo Giunchedi: txstatsd: gather runtime self metrics under statsd [puppet] - 10https://gerrit.wikimedia.org/r/174675 [09:32:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] txstatsd: gather runtime self metrics under statsd [puppet] - 10https://gerrit.wikimedia.org/r/174675 (owner: 10Filippo Giunchedi) [09:32:49] greetings [09:37:37] PROBLEM - DPKG on tungsten is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:37:57] that's me btw on tungsten [09:41:47] RECOVERY - DPKG on tungsten is OK: All packages OK [09:50:11] hello [09:51:57] <_joe_> good morning [09:57:30] _joe_: lots of hhvm alerts [09:57:38] WARNINGs and UNKNOWNs [09:58:35] <_joe_> paravoid: I think they are from graphite being flaky [09:58:50] also PROCS WARNING: 2 processes with command name 'hhvm' [09:59:07] <_joe_> paravoid: yeah that's stupid, I should raise that alue [09:59:20] <_joe_> as well, our code shells out 'php' [09:59:58] <_joe_> so it is expected to happen [10:01:43] _joe_: we can set mediawiki to shell out to some other command (e.g.: php5 or hhmv) by setting $wgPhpCli in mw config [10:02:38] <_joe_> hashar: we want to use hhvm in fact, even if cli performance sucks [10:03:11] _joe_: I noticed on the Trusty Jenkins slave that 'php' points to the Zend version [10:03:22] there is a Debian alternative but hhvm has a lower priority [10:03:32] <_joe_> hashar: mmmh not true on the mediawiki appservers [10:03:56] <_joe_> and I'm pretty sure we do install the alternative and set it as a default in the packages [10:04:00] yeah the CI slaves no more have hhvm enforced by puppet (some manifest has been rewrote meanwhile) [10:04:07] <_joe_> so... a contint slave issue I think [10:04:32] yeah [10:05:12] <_joe_> godog: seems graphite is not responding, but you may know it :) [10:06:57] _joe_: I found out a very small script to takes 200+ ms with hhvm vs 40ms with Zend :-/ So ended up force use of php5 . https://gerrit.wikimedia.org/r/#/c/172756/3/tools/mwcore-docgen.sh,unified :D [10:06:59] _joe_: the ui? I'm looking at gdash and it seems to be working [10:07:17] <_joe_> hashar: did you configure hhvm for the cli properly? [10:07:36] <_joe_> godog: well all icinga checks (or most of them) are unknown right now [10:07:50] _joe_: it is not configured at all :D [10:07:52] <_joe_> but they're recovering it seems [10:08:08] <_joe_> hashar: ok, I don't really have time to look into that now [10:08:20] ah yeah that's likely txstatsd restarting [10:08:23] (03PS1) 10Glaisher: Enable "Form Refresh" as a BetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175406 [10:08:27] _joe_: that was just for info :-) will poke it later on [10:18:51] (03PS1) 10Glaisher: Enable WikiLove extension on zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175407 [10:42:43] !log backfilling old txstatsd metrics from / to statsd/ on tungsten [10:42:50] Logged the message, Master [10:45:31] <_joe_> godog: have you thought about how to group stats? [10:46:03] <_joe_> (the server.* stats that we don't group right now in clusters) [10:47:10] godog: have you dropped the Graphite hierarchy for /jenkins/ci/ ? [10:48:38] _joe_: not in any detailed way, no, however carbon-c-relay might be able to aggregate those for example [10:49:00] hashar: not yet, however txstatsd got restarted so they shouldn't get updated [10:49:15] <_joe_> godog: ok I will take a look this week and start a more serious discussion [10:49:43] yep sounds good [10:51:27] hashar: I take it we're good to have it deleted though? [10:55:24] godog: yes. I am looking for the bug/Task :) [10:56:14] godog: https://phabricator.wikimedia.org/T1075#23924 :D [10:56:15] (03PS1) 10Giuseppe Lavagetto: hhvm: define more jit configurations [puppet] - 10https://gerrit.wikimedia.org/r/175412 [10:56:29] happy days [10:56:40] to aggregate Zuul metrics that is a feature request for upstream ( https://phabricator.wikimedia.org/T1369 ) [10:58:52] !log moved jenkins.ci under archived.jenkins.ci on tungsten, see T1075 [10:58:56] Logged the message, Master [10:59:19] hashar: yeah that'd be nice too [11:01:04] godog: no need to archive, you can get rid of them entirely :] [11:01:15] godog: none of the collected metrics have any use case [11:05:44] hashar: haha okay, I'll delete them tomorrow [11:26:26] (03PS1) 10Filippo Giunchedi: codfw: provision ms-be 2013/2014/2015 [dns] - 10https://gerrit.wikimedia.org/r/175416 [12:05:18] (03CR) 10Faidon Liambotis: [C: 031] "Sounds great :)" [puppet] - 10https://gerrit.wikimedia.org/r/174932 (owner: 10Filippo Giunchedi) [12:13:47] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail [12:21:06] (03PS1) 10QChris: Add its-phabricator-from-bugzilla f9fd2db7a62119ab9a6d1adfd3110b6e59b7a872 [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/175421 [12:28:10] (03PS1) 10Faidon Liambotis: reprepro: add update source "cassandra" to trusty [puppet] - 10https://gerrit.wikimedia.org/r/175423 [12:30:27] (03CR) 10Faidon Liambotis: [C: 032] reprepro: add update source "cassandra" to trusty [puppet] - 10https://gerrit.wikimedia.org/r/175423 (owner: 10Faidon Liambotis) [12:32:17] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [12:37:43] !log Added gerrit plugin its-phabricator-from-bugzilla (f9fd2db7a62119ab9a6d1adfd3110b6e59b7a872) [12:37:49] Logged the message, Master [12:38:45] (03PS1) 10Giuseppe Lavagetto: mediawiki: adjust hhvm max threads to number of cpus as well [puppet] - 10https://gerrit.wikimedia.org/r/175424 [12:38:47] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow toggling experimental hhvm settings [puppet] - 10https://gerrit.wikimedia.org/r/175425 [12:39:01] (03PS2) 10QChris: Add its-phabricator-from-bugzilla f9fd2db7a62119ab9a6d1adfd3110b6e59b7a872 [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/175421 [12:39:11] (03PS3) 10QChris: Add its-phabricator-from-bugzilla f9fd2db7a62119ab9a6d1adfd3110b6e59b7a872 [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/175421 [12:39:41] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: allow toggling experimental hhvm settings [puppet] - 10https://gerrit.wikimedia.org/r/175425 (owner: 10Giuseppe Lavagetto) [12:41:07] qchris: oh, so will it comment on phab now? [12:42:05] YuviPanda: Yes, if you use "Bug: $SOME_BUG_NR_FROM_BUGZILLA", it will use the magic +2000 trick (see T1327) and comment on the corresponding phabricator task. [12:42:13] ah, cool [12:42:26] but if I just use the Task number? [12:42:28] will it comment too? [12:42:32] Sure. [12:42:46] !ping [12:42:47] That is implemented since some weeks. [12:43:04] Uhhh any ops people, I'm trying to think critically and I could use a number [12:43:22] I'm wondering if we try to clear out the upload stash at any regular interval that might explain a race condition [12:43:38] Like "oh, they uploaded that but it's been idle for an hour, screw them" [12:44:11] marktraceur: we do cleanup the upload stash every hour indeed [12:44:23] er, no, every day [12:44:28] paravoid: At what time? :) [12:44:33] misc::maintenance::cleanup_upload_stash [12:44:38] puppet:manifests/misc/maintenance.pp [12:44:39] Or is it an "are you 24 hours old?" check? [12:44:44] hour => 1, [12:44:44] minute => 0, [12:44:52] Because if not, that would totally explain what's going on [12:45:44] what's going on with what? [12:45:51] $wgUploadStashMaxAge [12:45:52] Uh [12:46:11] paravoid: I'm still not totally sure. victorgrigas came into -commons last night and gave me an interesting new error to play with [12:46:20] But it turns out three totally separate conditions make it happen [12:46:26] So I'm trying to sort out which one actually did [12:46:53] (03PS2) 10Giuseppe Lavagetto: mediawiki: allow toggling experimental hhvm settings [puppet] - 10https://gerrit.wikimedia.org/r/175425 [12:46:55] Basically it says "invalid path given to uploadstash", but it could also mean an improperly formatted file key. [12:47:02] For no adequate reason [12:47:51] (03Abandoned) 10Giuseppe Lavagetto: varnish: remove fallback to Zend for HHVM. [puppet] - 10https://gerrit.wikimedia.org/r/163555 (owner: 10Giuseppe Lavagetto) [12:49:47] Maximum age 6 hours...so if someone went to bed and came back the next morning, that could be a serious issue. [12:50:27] paravoid: Do you foresee any issue, or would you like more information about, raising that to 24 hours (or even longer?) [12:51:38] <_joe_> !log restarting mw1230, with hyperthreading enabled [12:51:41] Logged the message, Master [12:52:23] we don't have Solr anywhere anymore, do we? [12:52:27] * YuviPanda is clearing out ops backlog in phab [12:53:06] <_joe_> we do [12:53:45] TTM hasn't been replaced yet afaik [12:53:49] Nikerabbit: ^ [12:54:17] PROBLEM - Host mw1230 is DOWN: PING CRITICAL - Packet loss = 100% [12:54:27] <_joe_> ach I thought I'd be faster [12:54:29] <_joe_> sorry [12:57:36] RECOVERY - Host mw1230 is UP: PING OK - Packet loss = 0%, RTA = 2.30 ms [12:59:18] (03PS1) 10Steinsplitter: Adding *.commonists.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175426 [13:01:26] <_joe_> cmjohnson: hi! I plan on installing quite a few of the remaining servers this afternoon [13:02:02] _joe_ great...my plan is to get the last 8 in a rack and set up for you [13:02:19] <_joe_> great! [13:03:34] paravoid: ah, ok. [13:08:00] (03PS1) 10MarkTraceur: Raise stash expiration limit on Commons for UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175428 [13:08:22] I'm also going to write a patch that should reduce the lost session issues [13:08:39] If someone closes the window or navigates away, I'll try to delete the file first [13:12:46] PROBLEM - NTP on mw1230 is CRITICAL: NTP CRITICAL: Offset unknown [13:16:47] RECOVERY - NTP on mw1230 is OK: NTP OK: Offset -0.001772403717 secs [13:16:53] YuviPanda: how's shinken? [13:17:17] (03CR) 10Jean-Frédéric: [C: 031] Adding *.commonists.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175426 (owner: 10Steinsplitter) [13:17:20] paravoid: not bad at all, actually. the packaging situation is a bit terrible, but outside of that... quite nice. [13:17:42] paravoid: the community also seems nice, except for... all the french :) [13:19:04] mark: (mr1-esams) ge-0/0/0 up down Core: << msw-oe12-esams [13:20:55] (03PS2) 10ArielGlenn: audit ssh key use on production cluster [software] - 10https://gerrit.wikimedia.org/r/174408 [13:21:48] (03CR) 10Gilles: [C: 031] Raise stash expiration limit on Commons for UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175428 (owner: 10MarkTraceur) [13:22:29] (03CR) 10ArielGlenn: "new patchset is a rewrite to use a handful of salt calls across all hosts instead of calls per user which was wasteful; also pep8 and pyli" [software] - 10https://gerrit.wikimedia.org/r/174408 (owner: 10ArielGlenn) [13:26:54] paravoid: i know [13:28:03] (03CR) 10Faidon Liambotis: "I don't think the mapping is 1:1; mobile frontend needs to explicitly be enabled per-wiki and in some cases we need to also set up Varnish" [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (owner: 10Dzahn) [13:39:14] <_joe_> bbl [13:39:44] paravoid: I still haven't actually scaled it horizontally. one machine is handling our current load. [13:40:47] cool [13:40:51] https://phabricator.wikimedia.org/T41789 [13:41:01] wow, wikimedia labs BZ/Phab has some really old tickets :) [13:42:06] heh [13:47:06] it's actually resolved only in january 2014 ;) [13:51:11] paravoid: what do you mean? TTM is now using ElasticSearch [13:51:25] And a bug about slow TTM was closed [13:51:48] oh is it [13:52:06] well okay, since last Tuesday [13:52:48] I missed last week's SoS [13:53:26] noone has filed a ticket for getting that machine back to my knowledge though [13:53:36] I'll take care of it [13:57:04] (03PS1) 10QChris: Add qchris to analytics contact group [puppet] - 10https://gerrit.wikimedia.org/r/175430 [14:21:55] !log performing test restart of elastic1002 to see what a rolling restart would be like while serving enwiki's searches [14:22:01] Logged the message, Master [14:22:35] sounds scary :) [14:24:20] aude: https://phabricator.wikimedia.org/T75739 [14:24:47] paravoid: shouldn't be - I just think we'll do a rolling restart sometime in the next week and figure I should know beforehand if it hurts. I really shouldn't [14:25:38] congrats btw, must feel amazing to finally have migrated everything [14:26:09] paravoid: yeah! and a tiny bit scary. [14:26:13] but no calls over the weekend [14:29:37] amazing indeed [14:30:00] But there's so much more that CirrusSearch is going to give us in the future! Especially interwiki search. :) [14:32:35] (03CR) 10Ottomata: [C: 032 V: 032] Increase kill timeout on kafka shutdown [debs/kafka] - 10https://gerrit.wikimedia.org/r/175011 (owner: 10Plucas) [14:35:34] (03PS2) 10Ottomata: Add qchris to analytics contact group [puppet] - 10https://gerrit.wikimedia.org/r/175430 (owner: 10QChris) [14:37:11] (03CR) 10Ottomata: [C: 032] Add qchris to analytics contact group [puppet] - 10https://gerrit.wikimedia.org/r/175430 (owner: 10QChris) [14:43:04] manybubbles: still not fixed :( [14:43:08] ? [14:43:33] aude: I just saw it this morning [14:43:45] I tried searching phab for it and didn't see it so I filed a new one [14:43:49] is it a dupe? [14:44:05] ok [14:44:14] it is a dupe and thought we fixed [14:44:24] shall have someone look into it [14:44:25] (03PS1) 10Giuseppe Lavagetto: varnish: remove redirection to the hhvm pool [puppet] - 10https://gerrit.wikimedia.org/r/175432 [14:44:26] !log restarting the elasticsearch server didn't cause any hickups. Rolling restart should be totally ok. [14:44:27] (03PS1) 10Giuseppe Lavagetto: mediawiki: convert hhvm appservers to be part of the common pool [puppet] - 10https://gerrit.wikimedia.org/r/175433 [14:44:29] (03PS1) 10Giuseppe Lavagetto: lvs: remove the HHVM specialized pools [puppet] - 10https://gerrit.wikimedia.org/r/175434 [14:44:30] Logged the message, Master [14:44:32] aude: thanks! [14:44:35] ottomata: besides the thing I mailed you about, there's a few others analytics alerts [14:44:55] thanks for poking :) [14:46:18] ok tahnks paravoid, am going through email now [14:47:07] iinteresting. [14:47:08] k [14:47:10] aude: of course! [14:50:39] (03CR) 10RobH: [C: 031] codfw: provision ms-be 2013/2014/2015 [dns] - 10https://gerrit.wikimedia.org/r/175416 (owner: 10Filippo Giunchedi) [14:53:59] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] codfw: provision ms-be 2013/2014/2015 [dns] - 10https://gerrit.wikimedia.org/r/175416 (owner: 10Filippo Giunchedi) [15:05:46] (03Abandoned) 10Yuvipanda: shinken: Fix notification commands to make email work [puppet] - 10https://gerrit.wikimedia.org/r/174780 (owner: 10Yuvipanda) [15:06:35] (03CR) 10Yuvipanda: kill facilities.pp, move to nagios_common (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173999 (owner: 10Dzahn) [15:10:03] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet last ran 3 days ago [15:11:03] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:16:38] (03CR) 10Chad: [C: 032 V: 032] Add its-phabricator-from-bugzilla f9fd2db7a62119ab9a6d1adfd3110b6e59b7a872 [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/175421 (owner: 10QChris) [15:19:09] (03PS1) 10Filippo Giunchedi: install-server add ms-be 2013/14/15 [puppet] - 10https://gerrit.wikimedia.org/r/175437 [15:20:04] (03PS2) 10Filippo Giunchedi: install-server add ms-be 2013/2014/2015 [puppet] - 10https://gerrit.wikimedia.org/r/175437 [15:20:28] (03PS3) 10Filippo Giunchedi: install-server add ms-be 2013/2014/2015 [puppet] - 10https://gerrit.wikimedia.org/r/175437 [15:20:38] (03PS4) 10Filippo Giunchedi: install-server add ms-be 2013/2014/2015 [puppet] - 10https://gerrit.wikimedia.org/r/175437 [15:20:43] RECOVERY - Disk space on analytics1021 is OK: DISK OK [15:21:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install-server add ms-be 2013/2014/2015 [puppet] - 10https://gerrit.wikimedia.org/r/175437 (owner: 10Filippo Giunchedi) [15:24:58] (03PS2) 10Filippo Giunchedi: install-server: add swift cache partition [puppet] - 10https://gerrit.wikimedia.org/r/174932 [15:25:12] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install-server: add swift cache partition [puppet] - 10https://gerrit.wikimedia.org/r/174932 (owner: 10Filippo Giunchedi) [15:31:13] ACKNOWLEDGEMENT - Disk space on analytics1003 is CRITICAL: Connection refused by host ottomata analytics1003 is not well. [15:31:14] ACKNOWLEDGEMENT - NTP on analytics1003 is CRITICAL: NTP CRITICAL: No response from NTP server ottomata analytics1003 is not well. [15:31:14] ACKNOWLEDGEMENT - RAID on analytics1003 is CRITICAL: Connection refused by host ottomata analytics1003 is not well. [15:31:14] ACKNOWLEDGEMENT - SSH on analytics1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ottomata analytics1003 is not well. [15:31:14] ACKNOWLEDGEMENT - check configured eth on analytics1003 is CRITICAL: Connection refused by host ottomata analytics1003 is not well. [15:31:14] ACKNOWLEDGEMENT - check if dhclient is running on analytics1003 is CRITICAL: Connection refused by host ottomata analytics1003 is not well. [15:31:14] ACKNOWLEDGEMENT - check if salt-minion is running on analytics1003 is CRITICAL: Connection refused by host ottomata analytics1003 is not well. [15:31:15] ACKNOWLEDGEMENT - puppet last run on analytics1003 is CRITICAL: Connection refused by host ottomata analytics1003 is not well. [15:32:02] ACKNOWLEDGEMENT - Disk space on analytics1003 is CRITICAL: Connection refused by host ottomata analytics1003 is not well. [15:38:06] (03PS1) 10ArielGlenn: first try at jenkins plugin to check prod vs ldap ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/175442 [15:40:22] (03CR) 10ArielGlenn: "I'd prefer to run this only when data.yaml is being updated, since it needs an ldap callout on every run, not sure how to do that" [puppet] - 10https://gerrit.wikimedia.org/r/175442 (owner: 10ArielGlenn) [15:42:46] PROBLEM - mediawiki-installation DSH group on mw1249 is CRITICAL: Host mw1249 is not in mediawiki-installation dsh group [15:42:57] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 112 failures [15:43:29] PROBLEM - mediawiki-installation DSH group on mw1252 is CRITICAL: Host mw1252 is not in mediawiki-installation dsh group [15:43:29] PROBLEM - mediawiki-installation DSH group on mw1239 is CRITICAL: Host mw1239 is not in mediawiki-installation dsh group [15:43:46] PROBLEM - mediawiki-installation DSH group on mw1250 is CRITICAL: Host mw1250 is not in mediawiki-installation dsh group [15:43:58] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 112 failures [15:43:58] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Puppet has 112 failures [15:44:26] PROBLEM - mediawiki-installation DSH group on mw1253 is CRITICAL: Host mw1253 is not in mediawiki-installation dsh group [15:45:07] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet has 112 failures [15:47:29] manybubbles, ^d, marktraceur: Did any of you have a burning desire to SWAT this morning, or should I just do it? [15:47:42] anomie: I can do it! I haven't done it in a while [15:47:48] manybubbles: Ok [15:48:01] I didn't see anything in the list [15:48:06] <^d> I see one config patch [15:48:11] I stand corrected! [15:48:17] I haven't done it in even longer, but go nuts [15:50:00] its not a config patch, actually. [15:50:10] looks like anomie already reviewed it. [15:50:18] just needs a submodule update [15:50:33] <_joe_> uh, swat this morning? [15:50:48] <_joe_> can you give me an heads-up before starting? [15:50:56] <_joe_> I am installing a few appservers right now [15:51:18] _joe_: sure [15:51:43] got an eta on when I should start? plan is in 10 minutes but I can certainly wait. there is only 1 patch this morning [15:52:08] <_joe_> well, I'd say in 20 we should be fine-ish [15:52:36] anomie: I just added a beta config patch to swat [15:52:52] errr manybubbles [15:53:16] bd808: kk [15:53:26] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:53:27] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:53:32] tonythomas: around for when I SWAT your patch in a few minutes? [15:54:28] manybubbles: I am now ! [15:54:41] tonythomas: cool. Looks like it'll go out in about 20 [15:54:48] is it something you can verify? [15:55:04] great :) it was merged, right ? [15:55:51] manybubbles: anomie thinks this too should go with it - https://gerrit.wikimedia.org/r/#/c/175445/ [15:56:13] Oh! I have a thing for SWAT actually. [15:56:47] tonythomas, manybubbles: In master, for code cleanliness. The backport isn't so necessary. [15:56:56] (03PS1) 10Giuseppe Lavagetto: dsh: add new mediawiki appservers [puppet] - 10https://gerrit.wikimedia.org/r/175446 [15:57:00] anomie: okey in that case [15:57:20] anomie: agree [15:57:23] marktraceur: oh, well then! [15:57:32] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] dsh: add new mediawiki appservers [puppet] - 10https://gerrit.wikimedia.org/r/175446 (owner: 10Giuseppe Lavagetto) [15:57:40] manybubbles: Added for you [15:57:45] thanks [15:58:30] (03CR) 10Manybubbles: [C: 031] "I'm going to assume that blog of config does what you say it does. I'll +2 and SWAT it soon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174988 (owner: 10BryanDavis) [15:59:02] There's nothing to test other than "the site is still up", so... [15:59:42] manybubbles: heh. I'm sort of assuming it does what it says it does at this point too. Luckily it's beta only and will be easy to revert if it goes boom. [15:59:55] <_joe_> manybubbles: we're almost ready [15:59:56] bd808: k [16:00:03] _joe_: oh cool. great timing [16:00:04] manybubbles, anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141124T1600). [16:00:06] I'm being slower [16:02:54] PROBLEM - mediawiki-installation DSH group on mw1255 is CRITICAL: Host mw1255 is not in mediawiki-installation dsh group [16:03:34] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 112 failures [16:04:24] PROBLEM - mediawiki-installation DSH group on mw1254 is CRITICAL: Host mw1254 is not in mediawiki-installation dsh group [16:04:48] marktraceur: I've had trouble with just setting wg variables in initializesettings. [16:04:54] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Puppet has 112 failures [16:04:58] I'm not sure what the magic is but it never works properly for me [16:05:19] I always set a wmg and then copy to a wg [16:05:25] manybubbles: Oh, well, the adjacent settings are set that way. I can double-check if you'd like [16:06:51] <_joe_> manybubbles: if you have another few minutes [16:06:57] <_joe_> I have a couple more machines to add [16:06:59] <_joe_> :) [16:07:09] sure sure [16:07:13] PROBLEM - HHVM rendering on mw1255 is CRITICAL: HTTP CRITICAL - No data received from host [16:07:35] marktraceur: yeah - probably a good call. I made the mistake of trusting that a long time ago and it wasn't pretty when I found out it hadn't taken [16:08:09] manybubbles: Check. [16:08:25] (03PS2) 10MarkTraceur: Raise stash expiration limit on Commons for UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175428 [16:10:33] (03PS1) 10Giuseppe Lavagetto: dsh: add two more servers to the list [puppet] - 10https://gerrit.wikimedia.org/r/175448 [16:11:04] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:11:28] (03CR) 10Giuseppe Lavagetto: [C: 032] dsh: add two more servers to the list [puppet] - 10https://gerrit.wikimedia.org/r/175448 (owner: 10Giuseppe Lavagetto) [16:12:38] _joe_: I'm now finally ready to click all the buttons when you are ready for me [16:12:54] <_joe_> 1 sec [16:13:12] <_joe_> manybubbles: waiting for the puppet run to act on tin [16:14:04] <_joe_> manybubbles: go :) [16:14:21] _joe_: ack [16:14:31] tonythomas: ok - you are first [16:14:54] (03CR) 10Manybubbles: [C: 032] Use Monolog provider for beta logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174988 (owner: 10BryanDavis) [16:15:03] (03Merged) 10jenkins-bot: Use Monolog provider for beta logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174988 (owner: 10BryanDavis) [16:15:03] bd808: actually, I lied, you are first [16:15:22] FIRST! [16:15:50] * tonythomas is here [16:15:57] !log manybubbles Synchronized wmf-config/logging-labs.php: SWAT update for labs - should be noop in production (duration: 00m 06s) [16:16:00] Logged the message, Master [16:16:19] bd808: ^ [16:16:33] (03CR) 10Manybubbles: [C: 032] Raise stash expiration limit on Commons for UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175428 (owner: 10MarkTraceur) [16:16:40] marktraceur: here comes yours [16:16:41] * bd808 watches the beta scap -- https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/31092/console [16:16:44] manybubbles: Cool [16:17:28] marktraceur: awe, nuts. I found a problem :( [16:17:35] manybubbles: What's that? [16:17:37] I haven't synced it because I think its missing an isset [16:17:47] (03CR) 10Manybubbles: [C: 04-1] Raise stash expiration limit on Commons for UW (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175428 (owner: 10MarkTraceur) [16:18:00] cool! I caught it before jenkins merged it [16:18:21] I'm no php expert but I think isset is needed there or we get 121234091823412304971234132 warnings per second [16:18:49] Fun. [16:19:00] marktraceur: can you check? [16:19:17] I agree, changing it [16:19:41] (03PS3) 10MarkTraceur: Raise stash expiration limit on Commons for UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175428 [16:20:14] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00664451827243 [16:20:28] slow scap in beta is slow :( and my change is still waiting behind it. [16:20:33] manybubbles: Fixed I think [16:22:00] !log manybubbles Synchronized php-1.25wmf9/extensions/BounceHandler/: SWAT update bounce handler to use right db (duration: 00m 06s) [16:22:03] Logged the message, Master [16:22:19] tonythomas: ^^^^ you are synced [16:22:22] marktraceur: checking [16:22:27] looks good - the change will show up rightaway ? [16:22:53] * tonythomas goes to create a bounce [16:23:15] (03PS1) 10Cmjohnson: Adding mw1221-1226 dhcpd [puppet] - 10https://gerrit.wikimedia.org/r/175449 [16:23:18] (03CR) 10Manybubbles: [C: 032] Raise stash expiration limit on Commons for UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175428 (owner: 10MarkTraceur) [16:23:34] (03Merged) 10jenkins-bot: Raise stash expiration limit on Commons for UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175428 (owner: 10MarkTraceur) [16:24:07] Yay [16:24:14] (03PS2) 10Cmjohnson: Adding mw1221-1226 dhcpd [puppet] - 10https://gerrit.wikimedia.org/r/175449 [16:24:17] !log manybubbles Synchronized wmf-config/: SWAT update config for stash limit in upload wizard (duration: 00m 06s) [16:24:19] Logged the message, Master [16:24:25] marktraceur: ^^^^ [16:24:36] manybubbles: Thanks, the site is still up (which is what's most important of course) [16:24:37] PROBLEM - mediawiki-installation DSH group on mw1257 is CRITICAL: Host mw1257 is not in mediawiki-installation dsh group [16:24:51] marktraceur: and we didn't get 12394872193471324 log messages [16:24:51] finally - we see bounce records in the table !! yay !!! [16:24:58] manybubbles: I have a file in my personal stash that I'll check at 16:30 to see if it's still there [16:25:04] Jeff_Green: ^^ party ! [16:25:16] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Puppet has 112 failures [16:25:20] (03CR) 10Cmjohnson: [C: 032] Adding mw1221-1226 dhcpd [puppet] - 10https://gerrit.wikimedia.org/r/175449 (owner: 10Cmjohnson) [16:25:26] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [16:27:15] ok! all done! going to go get a coffee now. no logs. if anything freaks out bother another SWATer. We're all around. [16:28:35] (03CR) 10BryanDavis: "PHP Fatal error: Call to undefined method MWLoggerLegacyLogger::setFormatter() in /mnt/srv/mediawiki-staging/php-master/includes/debug/lo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174988 (owner: 10BryanDavis) [16:28:46] (03PS1) 10BryanDavis: Revert "Use Monolog provider for beta logging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175450 [16:29:27] (03PS2) 10BryanDavis: Revert "Use Monolog provider for beta logging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175450 [16:29:35] (03CR) 10BryanDavis: [C: 032] Revert "Use Monolog provider for beta logging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175450 (owner: 10BryanDavis) [16:29:43] (03Merged) 10jenkins-bot: Revert "Use Monolog provider for beta logging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175450 (owner: 10BryanDavis) [16:30:46] !log bd808 Synchronized wmf-config/logging-labs.php: Revert monolog logging config (duration: 00m 05s) [16:30:50] Logged the message, Master [16:38:17] !log moved virt1000* certs out of /etc/ssl to verify that they are no longer used [16:38:22] Logged the message, Master [16:42:47] RECOVERY - mediawiki-installation DSH group on mw1249 is OK: OK [16:43:37] RECOVERY - mediawiki-installation DSH group on mw1252 is OK: OK [16:43:38] RECOVERY - mediawiki-installation DSH group on mw1239 is OK: OK [16:43:39] PROBLEM - mediawiki-installation DSH group on mw1258 is CRITICAL: Host mw1258 is not in mediawiki-installation dsh group [16:43:39] PROBLEM - DPKG on mw1258 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:43:47] RECOVERY - mediawiki-installation DSH group on mw1250 is OK: OK [16:44:27] RECOVERY - mediawiki-installation DSH group on mw1253 is OK: OK [16:46:22] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Puppet has 112 failures [16:46:47] RECOVERY - DPKG on mw1258 is OK: All packages OK [16:55:58] PROBLEM - HHVM rendering on mw1258 is CRITICAL: Connection refused [16:57:00] PROBLEM - Host mw1258 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:09] RECOVERY - Host mw1258 is UP: PING OK - Packet loss = 0%, RTA = 1.47 ms [16:58:19] RECOVERY - HHVM rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 72150 bytes in 9.222 second response time [16:59:01] RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:01:26] (03Abandoned) 10BBlack: SNI nginx ssl on varnish boxes at all sites [puppet] - 10https://gerrit.wikimedia.org/r/161180 (owner: 10BBlack) [17:02:58] RECOVERY - mediawiki-installation DSH group on mw1255 is OK: OK [17:04:27] RECOVERY - mediawiki-installation DSH group on mw1254 is OK: OK [17:04:42] (03PS1) 10BBlack: r::c::localssl: monitor based on $certname [puppet] - 10https://gerrit.wikimedia.org/r/175452 [17:06:44] (03CR) 10BBlack: [C: 032] r::c::localssl: monitor based on $certname [puppet] - 10https://gerrit.wikimedia.org/r/175452 (owner: 10BBlack) [17:15:19] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: puppet fail [17:16:48] PROBLEM - puppet last run on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:16:48] PROBLEM - HHVM processes on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:17:18] PROBLEM - HTTPS_unified on cp4005 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match unified.wikimedia.org) [17:17:29] PROBLEM - RAID on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:17:59] PROBLEM - check configured eth on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:18:19] PROBLEM - check if dhclient is running on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:18:29] PROBLEM - check if salt-minion is running on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:18:39] PROBLEM - mediawiki-installation DSH group on mw1256 is CRITICAL: Host mw1256 is not in mediawiki-installation dsh group [17:18:39] PROBLEM - DPKG on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:18:49] PROBLEM - nutcracker port on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:18:49] PROBLEM - Disk space on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:19:09] PROBLEM - nutcracker process on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:19:18] PROBLEM - HTTPS_unified on cp4018 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match unified.wikimedia.org) [17:20:09] (03PS1) 10Andrew Bogott: Removed ldap server role from virt1000. [puppet] - 10https://gerrit.wikimedia.org/r/175454 [17:22:20] (03PS1) 10BBlack: r::c::ssl::misc: switch to r::c::localssl like prod SNI [puppet] - 10https://gerrit.wikimedia.org/r/175455 [17:23:36] PROBLEM - HTTPS_unified on cp4012 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match unified.wikimedia.org) [17:23:36] PROBLEM - HTTPS_unified on cp4013 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match unified.wikimedia.org) [17:23:36] PROBLEM - HTTPS_unified on cp4017 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match unified.wikimedia.org) [17:23:46] PROBLEM - HTTPS_unified on cp4016 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match unified.wikimedia.org) [17:23:46] PROBLEM - HTTPS_unified on cp4009 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match unified.wikimedia.org) [17:23:46] PROBLEM - HTTPS_unified on cp4010 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match unified.wikimedia.org) [17:23:55] PROBLEM - HTTPS_unified on cp4002 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match unified.wikimedia.org) [17:23:55] PROBLEM - HTTPS_unified on cp4011 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match unified.wikimedia.org) [17:23:56] RECOVERY - check if salt-minion is running on mw1256 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:24:07] eh? [17:24:15] RECOVERY - RAID on mw1256 is OK: OK: no RAID installed [17:24:25] RECOVERY - Disk space on mw1256 is OK: DISK OK [17:24:25] RECOVERY - nutcracker port on mw1256 is OK: TCP OK - 0.000 second response time on port 11212 [17:24:25] RECOVERY - HHVM processes on mw1256 is OK: PROCS OK: 1 process with command name hhvm [17:24:26] RECOVERY - check configured eth on mw1256 is OK: NRPE: Unable to read output [17:24:35] RECOVERY - nutcracker process on mw1256 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:24:36] RECOVERY - check if dhclient is running on mw1256 is OK: PROCS OK: 0 processes with command name dhclient [17:25:35] greg-g: don't worry, it's just a monitoring issue [17:25:46] PROBLEM - HTTPS_unified on cp4015 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match unified.wikimedia.org) [17:26:16] RECOVERY - DPKG on mw1256 is OK: All packages OK [17:28:08] (03PS2) 10Andrew Bogott: Removed ldap server role from virt1000. [puppet] - 10https://gerrit.wikimedia.org/r/175454 [17:28:10] (03PS1) 10Andrew Bogott: Switch keystone over to use the new ldap server name, ldap-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/175456 [17:29:11] (03PS1) 10BBlack: Revert "r::c::localssl: monitor based on $certname" [puppet] - 10https://gerrit.wikimedia.org/r/175457 [17:29:19] (03CR) 10Andrew Bogott: [C: 032] Switch keystone over to use the new ldap server name, ldap-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/175456 (owner: 10Andrew Bogott) [17:29:34] (03PS2) 10BBlack: Revert "r::c::localssl: monitor based on $certname" [puppet] - 10https://gerrit.wikimedia.org/r/175457 [17:29:41] (03CR) 10BBlack: [C: 032 V: 032] Revert "r::c::localssl: monitor based on $certname" [puppet] - 10https://gerrit.wikimedia.org/r/175457 (owner: 10BBlack) [17:30:26] PROBLEM - DPKG on mw1256 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:32:36] RECOVERY - DPKG on mw1256 is OK: All packages OK [17:32:36] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [17:32:56] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Puppet has 7 failures [17:35:21] (03PS1) 10Giuseppe Lavagetto: mediawiki: enable experimental HHVM features on one host [puppet] - 10https://gerrit.wikimedia.org/r/175460 [17:35:36] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:36:55] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:38:37] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [17:39:37] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:39:38] RECOVERY - HTTPS_unified on cp4015 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 422 days) [17:39:57] RECOVERY - HTTPS_unified on cp4012 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 422 days) [17:39:57] RECOVERY - HTTPS_unified on cp4013 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 422 days) [17:39:57] RECOVERY - HTTPS_unified on cp4017 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 422 days) [17:39:58] RECOVERY - HTTPS_unified on cp4005 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 422 days) [17:39:58] RECOVERY - HTTPS_unified on cp4016 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 422 days) [17:39:58] RECOVERY - HTTPS_unified on cp4009 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 422 days) [17:39:58] RECOVERY - HTTPS_unified on cp4010 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 422 days) [17:39:58] RECOVERY - HTTPS_unified on cp4002 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 422 days) [17:39:59] RECOVERY - HTTPS_unified on cp4011 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 422 days) [17:40:38] RECOVERY - HTTPS_unified on cp4018 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 422 days) [17:42:00] (03PS2) 10BBlack: r::c::ssl::misc: switch to r::c::localssl like prod SNI [puppet] - 10https://gerrit.wikimedia.org/r/175455 [17:43:44] (03PS1) 10Giuseppe Lavagetto: dsh: add more appservers to the groups [puppet] - 10https://gerrit.wikimedia.org/r/175462 [17:44:16] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] dsh: add more appservers to the groups [puppet] - 10https://gerrit.wikimedia.org/r/175462 (owner: 10Giuseppe Lavagetto) [17:44:47] I'm trying to check in puppet-compiler something unrelated and hitting: "Could not find class passwords::phabricator for cp1043.eqiad.wmnet on node cp1043.eqiad.wmnet" ... [17:44:52] what's that all about? [17:45:19] oh wait, it's probably the facts update thing [17:46:47] <_joe_> bblack: no [17:46:52] it probably uses labs/private [17:46:54] <_joe_> it's the puppet-private class [17:46:55] oh wait _joe_ is here [17:47:01] :) [17:47:02] <_joe_> :) [17:47:33] <_joe_> bblack: I fixed it the other day but something must have been lost [17:47:48] sounds like my sanity [17:47:57] <_joe_> lol [17:48:29] <_joe_> bblack: looking into it now [17:49:28] <_joe_> bblack: what run is that? [17:49:32] <_joe_> the class is there [17:49:44] http://puppet-compiler.wmflabs.org/525/change/175455/html/cp1043.eqiad.wmnet.html [17:52:07] <_joe_> bblack: mmmh I don't understand. the class is there [17:53:36] oh I think I see the problem, maybe [17:53:41] <_joe_> bblack: also, it only happens on production?? [17:54:03] I thought "class admin" used to "include standard"? we have some nodes that just include admin but not standard on that assumption [17:54:40] I donno about what I just said, but it's something I'm looking into history on now [17:54:51] <_joe_> bblack: no it's not that [17:55:07] <_joe_> give me 10 minutes to understand what the heck is happening here :) [17:55:41] <_joe_> (the changed manifest is compiled btw) [17:57:42] well maybe it's unrelated, but there's definitely something odd about the relationship between classes base, admin, and standard that I think is new since I last looked at them, and I suspect it means base/standard aren't being applied to cache nodes [17:58:05] (which only "include admin" of the 3) [17:58:07] <_joe_> bblack: uh, seriously? [17:58:19] <_joe_> I thought standard and base are included in the roles [17:58:26] maybe [17:58:39] they're also included direclty in the node def in some cases, but not others [17:58:43] <_joe_> http://puppet-compiler.wmflabs.org/525/change/175455/compiled/puppet_catalogs_3_175455/cp1043.eqiad.wmnet.pson [17:58:50] <_joe_> looks like they're there [17:59:21] yeah seems so, via role::cache::foo [17:59:42] admin is the oddball as it has to be defined at node level [17:59:45] the others idk [17:59:46] anyways, I guess my confusion comes from the inconsistency of how this is laid out [17:59:47] <_joe_> so what just happened is plainly weird [18:00:29] "standard" is defined in site.pp, sometimes included in the node in site.pp even if it has a role, sometimes not. and then deep down in other role modules, sometimes they "include standard" and sometimes not? [18:00:46] yes standard is a mess that way [18:00:57] (and standard includes base, but sometimes base is explicit as well) [18:01:58] <_joe_> bblack: it's not clear to me how compilation failed on the prod branch, I'm setting it up again [18:02:27] ok [18:02:30] the assignment of standard and base are wildly incosistent I think, and yeah it's not good [18:02:49] just wanted to chime in taht admin is a legit oddball as it needs to be at node level [18:02:49] for stupid puppet reasons [18:03:03] at least for the moment [18:03:15] <_joe_> chasemp: admin can be included, and then we can use hiera to define its parameters [18:03:22] <_joe_> it's already done on the newer appservers [18:03:34] yup all to me just trying to help :) [18:03:38] all good even [18:03:59] <_joe_> bblack: labs/private being not public is preventing the puppet-compiler from working [18:04:05] <_joe_> I am re-publishing it [18:04:18] <_joe_> and there goes my pre-meeting break :( [18:04:58] I thought labs/private used to be public? [18:05:04] <_joe_> it was [18:05:13] we have upgraded to puppet 3 since then so I think the huge jankyness that made that necessary may have subsided [18:05:28] https://github.com/wikimedia/labs-private [18:05:34] admin at node level but it would require a lot of poking to be sure, puppet 2.7 had some undesirableness [18:05:38] <_joe_> bblack: I know [18:05:42] :) [18:05:47] <_joe_> I sent an email to ops@ last week [18:06:01] <_joe_> since noone claimed why this was useful, I'm gonna revert that [18:06:30] ah I see the thread now [18:07:06] (03PS1) 10BBlack: Turn on r::c::ssl::sni locall for varnishes [puppet] - 10https://gerrit.wikimedia.org/r/175464 [18:07:08] (03PS1) 10BBlack: Switch LVS to use localssl at all sites [puppet] - 10https://gerrit.wikimedia.org/r/175465 [18:07:10] (03PS1) 10BBlack: remove legacy protoproxy config [puppet] - 10https://gerrit.wikimedia.org/r/175466 [18:09:40] greg-g: just checking--we should reserve a deployment window to make config changes, true? [18:10:15] awight: T [18:10:23] thanks! [18:10:35] :) [18:12:37] (03CR) 10Ori.livneh: [C: 031] mediawiki: adjust hhvm max threads to number of cpus as well [puppet] - 10https://gerrit.wikimedia.org/r/175424 (owner: 10Giuseppe Lavagetto) [18:15:16] greg-g: ok I've reserved 14:00-15:00 today for a CentralNotice config change. We're pretty confident it won't have side-effects, since it's already deployed and tested on mediawiki.org and betalabs... [18:16:11] * greg-g nods [18:16:17] awight: cool, godspeed [18:16:47] :) [18:17:14] what's the worst that could happen \o/ [18:18:47] RECOVERY - mediawiki-installation DSH group on mw1256 is OK: OK [18:19:51] (03PS1) 10Rush: phab run user updates once daily [puppet] - 10https://gerrit.wikimedia.org/r/175470 [18:20:02] awight: every time someone says that now, I think of this movie scene: http://www.youtube.com/watch?v=20KJhBX9xtE [18:20:51] (03CR) 10Rush: [C: 031] bugzilla: delete bugzilla.wikiPedia.org [puppet] - 10https://gerrit.wikimedia.org/r/175139 (owner: 10Dzahn) [18:21:09] (03CR) 10Rush: [C: 031] bugzilla: delete bugs.wikipedia.org vhost [puppet] - 10https://gerrit.wikimedia.org/r/175136 (owner: 10Dzahn) [18:22:14] bblack: thanks! I think I'll just sign up for the T-shirt, now. [18:24:47] RECOVERY - mediawiki-installation DSH group on mw1257 is OK: OK [18:26:48] (03CR) 10Rush: [C: 032] phab run user updates once daily [puppet] - 10https://gerrit.wikimedia.org/r/175470 (owner: 10Rush) [18:30:19] (03CR) 10Dzahn: [C: 032] bugzilla: delete bugs.wikipedia.org vhost [puppet] - 10https://gerrit.wikimedia.org/r/175136 (owner: 10Dzahn) [18:42:00] (03Abandoned) 10Alexandros Kosiaris: Allocate codfw Labs networks [dns] - 10https://gerrit.wikimedia.org/r/174732 (owner: 10Alexandros Kosiaris) [18:43:37] RECOVERY - mediawiki-installation DSH group on mw1258 is OK: OK [18:54:20] (03PS1) 10Ottomata: Add marktraceur to researchers grouo [puppet] - 10https://gerrit.wikimedia.org/r/175486 [18:57:55] chasemp: something on iridium (phab?) is hitting virt1000 ldap occasionally. Can you dig in and see if there are refs to virt1000 or wikitech somewhere in that setup? (virt1000's ldap server is deprecated, I'm about to switch it off.) [18:58:15] (03PS3) 10Ori.livneh: Allow host-specific HHVM config overrides via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/175425 (owner: 10Giuseppe Lavagetto) [18:58:33] there are [18:58:37] ^ _joe_ [18:58:39] what should I change it too? [18:58:47] andrewbogott: ^ [18:58:56] Hey opsy people [18:59:10] IS there any way to look at what files exist on the Swift backend somehow [18:59:25] Because I have a file that *should* be in the stash, that MediaWiki is telling me doesn't exist. [18:59:40] (03CR) 10Ottomata: [C: 032] Add marktraceur to researchers grouo [puppet] - 10https://gerrit.wikimedia.org/r/175486 (owner: 10Ottomata) [18:59:47] chasemp: primary ldap-eqiad.wikimedia.org, secondary, ldap-codfw.wikimedia.org [18:59:55] andrewbogott: https://gerrit.wikimedia.org/r/#/c/175153/ was for you! :P [19:00:04] marktraceur: the Ops weekly meeting just started, so you picked a bad time to ask :) [19:00:05] k I'll test post call (have another 1 right after ours) [19:00:10] and if cool then cool I'll let you know [19:00:15] KK no problem [19:00:16] ori: I saw that, thank you! Will read soon [19:00:38] somone know where the phabricator api is located on mw? /api/ dosen't exist there [19:00:49] marktraceur: eval.php? :) [19:00:50] (03PS2) 10Ori.livneh: hhvm: set nofile limit to unlimited [puppet] - 10https://gerrit.wikimedia.org/r/175391 [19:01:12] marktraceur: we may have the (python) swift client somewhere (accessible to you) installed, I'm not sure [19:01:26] Hm. [19:01:39] The weird thing is [19:01:45] It shows up in Special:UploadStash [19:01:47] Steinsplitter: http://phabricator.wikimedia.org/conduit/ [19:01:52] But then throws errors when UW calls the API [19:01:54] Very odd. [19:02:08] legoktm: thx :) [19:02:09] Steinsplitter: what are you trying to do? -devtools might be a better channel [19:02:20] andrewbogott: changed I think, seems good! [19:02:31] (figured why not) [19:03:30] legoktm: thanks :), trying to write a script for automatically listing of recent commosn related bugs [19:04:00] Steinsplitter: did you try using phabricator's query thing first? [19:04:25] https://phabricator.wikimedia.org/maniphest/ "Edit Query" [19:05:02] legoktm: i can't get a json (or similar) output there [19:05:23] Steinsplitter: you can use https://github.com/legoktm/fab to wrap around the API [19:06:01] Steinsplitter: or if you want a stream of changes, we have one on tool labs in redis [19:06:31] ok [19:06:42] will look at the api thing, thanks [19:06:51] chasemp: looks better so far, I'll keep my eye on the log for a while [19:06:55] k [19:19:16] Hmm [19:19:24] https://old-bugzilla.wikimedia.org/show_bug.cgi?id=56659 is now an SSL error [19:19:30] Because the cert is still for bugzilla.wm.o [19:19:48] known issue, on ops meeting agenda [19:20:12] more than likely to move behind misc-web later today [19:20:24] (we're in the meeting now ;) [19:22:08] (03CR) 10Alexandros Kosiaris: "Hello, yes I assigned the bug to myself and will work on it soon." [debs/librsvg] - 10https://gerrit.wikimedia.org/r/173639 (owner: 10Ebrahim) [19:27:04] Cool [19:27:10] Just making sure it's on the radar :) [19:28:47] RoanKattouw, third item at top of https://www.mediawiki.org/wiki/Phabricator/versus_Bugzilla [19:28:52] "Don't be surprised if your browser warns you about old-bugzilla being insecure or that "your connection is not private" while we get the certificate to work." [19:31:53] (03CR) 10Greg Grossmeier: "What's blocking this from being merged?" [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [19:33:43] (03PS9) 10BryanDavis: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [19:34:07] (03CR) 10BryanDavis: [C: 031] beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [19:41:32] (03PS2) 10Tim Landscheidt: fwconfigtool: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169252 [19:43:51] (03PS3) 10Tim Landscheidt: fwconfigtool: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169252 [19:48:48] (03PS1) 10Ori.livneh: Add 'apache-status' shell script [puppet] - 10https://gerrit.wikimedia.org/r/175497 [19:48:50] (03PS1) 10Ori.livneh: Update ori dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/175498 [19:50:06] (03CR) 10Ori.livneh: [C: 032] Add 'apache-status' shell script [puppet] - 10https://gerrit.wikimedia.org/r/175497 (owner: 10Ori.livneh) [19:50:15] (03CR) 10Ori.livneh: [C: 032] Update ori dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/175498 (owner: 10Ori.livneh) [19:51:33] i got kicked out! calling the line [19:52:33] in. [19:53:36] (03CR) 10Jforrester: "Does this work well enough yet to deploy? Last I heard there were several big issues so it was deferred." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175406 (owner: 10Glaisher) [19:57:45] jgage: i'm not working tomorrow, so I won't be at our analyitcs/ops check in [19:59:59] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [20:00:10] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [20:01:05] cmjohnson: are you on duty now? [20:01:11] i am [20:04:43] (03PS1) 10Ori.livneh: Fixes for Varnishkafka gmond module [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175499 [20:04:46] ottomata: ^ [20:07:01] cmjohnson: topic updaaaated now yer screwed! [20:07:10] kiss your free time goodbye [20:07:13] heh..gee thx [20:07:19] you know, that time you used to sleep. [20:09:51] (03CR) 10Ori.livneh: [C: 032] hhvm: set nofile limit to unlimited [puppet] - 10https://gerrit.wikimedia.org/r/175391 (owner: 10Ori.livneh) [20:10:00] awesome ori, thanks! i'm going for a short walk after my day of meetings, will merge that whne I get back. [20:10:18] ottomata: cool [20:10:38] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:10:39] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [20:12:57] PROBLEM - Apache HTTP on mw1248 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [20:12:58] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Puppet has 1 failures [20:12:58] PROBLEM - HHVM rendering on mw1248 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [20:12:58] PROBLEM - HHVM processes on mw1163 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:13:08] PROBLEM - Apache HTTP on mw1053 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.019 second response time [20:13:17] PROBLEM - puppet last run on mw1023 is CRITICAL: CRITICAL: Puppet has 1 failures [20:13:17] PROBLEM - HHVM rendering on mw1029 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.018 second response time [20:13:18] PROBLEM - HHVM processes on mw1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:13:18] PROBLEM - HHVM rendering on mw1053 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.027 second response time [20:13:18] PROBLEM - HHVM rendering on mw1163 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [20:13:18] PROBLEM - HHVM rendering on mw1023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.021 second response time [20:13:18] PROBLEM - HHVM processes on mw1248 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:13:27] PROBLEM - Apache HTTP on mw1023 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.017 second response time [20:13:28] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Puppet has 1 failures [20:13:37] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet has 1 failures [20:13:39] PROBLEM - Apache HTTP on mw1163 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [20:13:40] PROBLEM - HHVM rendering on mw1239 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.006 second response time [20:13:40] PROBLEM - Apache HTTP on mw1029 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.022 second response time [20:13:47] PROBLEM - HHVM processes on mw1023 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:13:47] PROBLEM - Apache HTTP on mw1239 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.013 second response time [20:13:48] PROBLEM - HHVM processes on mw1029 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:13:57] PROBLEM - HHVM processes on mw1239 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:13:57] PROBLEM - puppet last run on mw1029 is CRITICAL: CRITICAL: Puppet has 1 failures [20:13:57] PROBLEM - HHVM processes on mw1229 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:14:19] wtf [20:14:21] i'll revert [20:14:28] PROBLEM - HHVM rendering on mw1032 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.012 second response time [20:14:34] (03PS1) 10Ori.livneh: Revert "hhvm: set nofile limit to unlimited" [puppet] - 10https://gerrit.wikimedia.org/r/175500 [20:14:38] PROBLEM - puppet last run on mw1022 is CRITICAL: CRITICAL: Puppet has 1 failures [20:14:38] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Puppet has 1 failures [20:14:46] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "hhvm: set nofile limit to unlimited" [puppet] - 10https://gerrit.wikimedia.org/r/175500 (owner: 10Ori.livneh) [20:14:47] PROBLEM - Apache HTTP on mw1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.010 second response time [20:14:48] PROBLEM - HHVM processes on mw1032 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:14:57] PROBLEM - HHVM processes on mw1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:14:57] PROBLEM - puppet last run on mw1032 is CRITICAL: CRITICAL: Puppet has 1 failures [20:14:58] PROBLEM - Apache HTTP on mw1032 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.016 second response time [20:14:58] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Puppet has 1 failures [20:15:18] PROBLEM - HHVM rendering on mw1022 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.018 second response time [20:15:18] PROBLEM - HHVM rendering on mw1229 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [20:15:19] PROBLEM - Apache HTTP on mw1229 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.008 second response time [20:15:19] PROBLEM - HHVM rendering on mw1024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.016 second response time [20:15:37] PROBLEM - HHVM processes on mw1024 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:15:38] PROBLEM - Apache HTTP on mw1024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [20:15:50] PROBLEM - puppet last run on mw1024 is CRITICAL: CRITICAL: Puppet has 1 failures [20:16:18] PROBLEM - Apache HTTP on mw1231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time [20:16:18] PROBLEM - HHVM processes on mw1231 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:16:47] PROBLEM - puppet last run on mw1027 is CRITICAL: CRITICAL: Puppet has 1 failures [20:16:48] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.069 second response time [20:16:57] RECOVERY - HHVM processes on mw1023 is OK: PROCS OK: 1 process with command name hhvm [20:16:57] PROBLEM - HHVM rendering on mw1231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [20:16:57] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.082 second response time [20:16:57] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 1 failures [20:16:57] RECOVERY - HHVM processes on mw1029 is OK: PROCS OK: 1 process with command name hhvm [20:16:58] RECOVERY - HHVM processes on mw1032 is OK: PROCS OK: 1 process with command name hhvm [20:16:58] RECOVERY - HHVM processes on mw1022 is OK: PROCS OK: 1 process with command name hhvm [20:17:08] PROBLEM - HHVM rendering on mw1236 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.028 second response time [20:17:09] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.092 second response time [20:17:17] RECOVERY - HHVM processes on mw1163 is OK: PROCS OK: 1 process with command name hhvm [20:17:22] _joe_: that was my screw-up (got the upstart config line wrong). fixed. sorry. [20:17:23] PROBLEM - Apache HTTP on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.013 second response time [20:17:23] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.138 second response time [20:17:29] RECOVERY - HHVM rendering on mw1022 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.220 second response time [20:17:37] PROBLEM - HHVM processes on mw1241 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:17:38] RECOVERY - HHVM rendering on mw1029 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.257 second response time [20:17:38] RECOVERY - HHVM rendering on mw1024 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.240 second response time [20:17:38] RECOVERY - HHVM processes on mw1053 is OK: PROCS OK: 1 process with command name hhvm [20:17:38] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.188 second response time [20:17:39] PROBLEM - Apache HTTP on mw1236 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time [20:17:39] RECOVERY - HHVM rendering on mw1163 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.231 second response time [20:17:39] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.200 second response time [20:17:39] PROBLEM - puppet last run on mw1236 is CRITICAL: CRITICAL: Puppet has 1 failures [20:17:40] PROBLEM - HHVM processes on mw1236 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [20:17:40] RECOVERY - HHVM rendering on mw1032 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.190 second response time [20:17:47] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [20:17:48] PROBLEM - HHVM rendering on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.035 second response time [20:17:48] RECOVERY - HHVM processes on mw1024 is OK: PROCS OK: 1 process with command name hhvm [20:17:59] RECOVERY - Apache HTTP on mw1163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [20:17:59] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time [20:17:59] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: Puppet has 1 failures [20:18:25] mw1241 is a new install, not related [20:18:47] (03CR) 10BBlack: "I'd like to pipe this through the catalog compiler one more time for a few representative prod caches just to double-check before merge, b" [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [20:19:03] <_joe_> ori: just arrived, should I do something? [20:19:08] <_joe_> I was having dinner [20:19:43] _joe_: it's fine, i was an idiot but i reverted and fixed it [20:24:46] (03Abandoned) 10Dzahn: bugzilla: install old-bugzilla SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/175162 (owner: 10Dzahn) [20:26:35] (03PS3) 10Dzahn: bugzilla: delete bugzilla.wikiPedia.org [puppet] - 10https://gerrit.wikimedia.org/r/175139 [20:27:16] (03CR) 10Dzahn: [C: 032] "will still work, redirect is in cluster Apache config" [puppet] - 10https://gerrit.wikimedia.org/r/175139 (owner: 10Dzahn) [20:30:58] RECOVERY - puppet last run on mw1029 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [20:30:58] RECOVERY - Apache HTTP on mw1248 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.345 second response time [20:31:07] RECOVERY - HHVM rendering on mw1248 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.431 second response time [20:31:07] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [20:31:18] RECOVERY - puppet last run on mw1023 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [20:31:27] RECOVERY - HHVM processes on mw1248 is OK: PROCS OK: 1 process with command name hhvm [20:31:37] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:31:47] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [20:31:47] RECOVERY - HHVM rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.659 second response time [20:31:48] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:31:48] RECOVERY - Apache HTTP on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time [20:31:58] RECOVERY - HHVM processes on mw1239 is OK: PROCS OK: 1 process with command name hhvm [20:32:59] RECOVERY - puppet last run on mw1032 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:32:59] RECOVERY - HHVM processes on mw1229 is OK: PROCS OK: 1 process with command name hhvm [20:33:07] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:33:18] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.100 second response time [20:33:18] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.711 second response time [20:33:19] RECOVERY - HHVM processes on mw1231 is OK: PROCS OK: 1 process with command name hhvm [20:33:28] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [20:33:48] RECOVERY - puppet last run on mw1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:33:48] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.404 second response time [20:33:57] RECOVERY - puppet last run on mw1024 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:33:58] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:34:17] RECOVERY - Apache HTTP on mw1241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.357 second response time [20:34:28] RECOVERY - HHVM processes on mw1241 is OK: PROCS OK: 1 process with command name hhvm [20:34:48] RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.411 second response time [20:35:08] RECOVERY - HHVM rendering on mw1236 is OK: HTTP OK: HTTP/1.1 200 OK - 72305 bytes in 0.158 second response time [20:35:08] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:35:28] RECOVERY - Apache HTTP on mw1236 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.096 second response time [20:35:38] RECOVERY - puppet last run on mw1236 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:35:38] RECOVERY - HHVM processes on mw1236 is OK: PROCS OK: 1 process with command name hhvm [20:35:57] RECOVERY - puppet last run on mw1027 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [20:47:39] (03CR) 10Jaredzimmerman: [C: 04-1] "We're waiting on the remaining controls, radio, combo, and drop down, Prateek finished most of them last week, he can give an ETA for the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175406 (owner: 10Glaisher) [20:57:29] (03CR) 10Hashar: "We already have a Jenkins job that execute a script in modules/admin/data, so I would just add your command to it." [puppet] - 10https://gerrit.wikimedia.org/r/175442 (owner: 10ArielGlenn) [20:58:33] (03CR) 10Dzahn: "great idea to make this thing a jenkins plugin" [puppet] - 10https://gerrit.wikimedia.org/r/175442 (owner: 10ArielGlenn) [21:00:04] gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141124T2100). [21:17:48] is git.wikimedia.org broken again? [21:21:53] ori, yt? this looks great. tested and ready to merge? [21:22:00] !log disabled ldap replication on virt1000 [21:22:04] Logged the message, Master [21:22:34] (03CR) 10Andrew Bogott: [C: 032] Removed ldap server role from virt1000. [puppet] - 10https://gerrit.wikimedia.org/r/175454 (owner: 10Andrew Bogott) [21:23:41] !log stopping opendj service on virt1000 [21:23:44] Logged the message, Master [21:24:17] it is jackmcbarn [21:24:49] let me give it a kick [21:26:39] !log restarting opendj on labcontrol2001 and neptunium [21:26:41] Logged the message, Master [21:26:47] !log restarting pdns on virt1000 and labcontrol2001 [21:26:48] PROBLEM - LDAP on virt1000 is CRITICAL: Connection refused [21:26:51] Logged the message, Master [21:27:07] PROBLEM - LDAPS on virt1000 is CRITICAL: Connection refused [21:27:43] ottomata: yep [21:27:56] ottomata: well, you should review it [21:28:47] ori, yeah, i just read through it [21:28:53] looks good to me, but i haven't tested it [21:28:59] PROBLEM - git.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 516 bytes in 0.401 second response time [21:29:19] ACKNOWLEDGEMENT - LDAP on virt1000 is CRITICAL: Connection refused andrew bogott virt1000 is no longer an ldap server. These checks will be removed shortly. [21:29:19] ACKNOWLEDGEMENT - LDAPS on virt1000 is CRITICAL: Connection refused andrew bogott virt1000 is no longer an ldap server. These checks will be removed shortly. [21:29:40] ottomata: i tested it, see pm [21:31:17] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 58828 bytes in 0.409 second response time [21:32:40] !log restarted gitblit on antimony [21:32:43] Logged the message, Master [21:32:45] jackmcbarn: is back up. [21:32:51] thanks [21:33:53] jackmcbarn: yw! [21:37:41] * matanya is looking for a Reedy [21:37:55] It's a TRAP [21:38:06] (03PS1) 10RobH: setting up a temp scs-c8-codfw setting up a temp scs-c8-codfw [dns] - 10https://gerrit.wikimedia.org/r/175547 [21:38:55] (03CR) 10Matanya: "why is the commit message duplicate?" [dns] - 10https://gerrit.wikimedia.org/r/175547 (owner: 10RobH) [21:39:00] robh: grrrit-wm loves making your commits sound weird :p [21:39:21] matanya: because i dont care. [21:39:26] +1 [21:39:27] honesty. [21:39:42] (03CR) 10RobH: [C: 032] setting up a temp scs-c8-codfw setting up a temp scs-c8-codfw [dns] - 10https://gerrit.wikimedia.org/r/175547 (owner: 10RobH) [21:39:44] heh [21:39:49] (03PS2) 10Ori.livneh: Fixes for Varnishkafka gmond module [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175499 [21:39:51] its a temp change that'll go away [21:40:10] awwww frick [21:40:20] i should have cared more, i have a typo. [21:40:24] Broke stuff already? :p [21:40:25] luckily, i no merge. [21:40:44] yup, it bits you in the ass if you don't care :P [21:41:15] (03PS1) 10RobH: setting scs-c8-codfw [dns] - 10https://gerrit.wikimedia.org/r/175549 [21:42:37] (03CR) 10BBlack: [C: 031] varnish: remove redirection to the hhvm pool [puppet] - 10https://gerrit.wikimedia.org/r/175432 (owner: 10Giuseppe Lavagetto) [21:43:17] (03PS3) 10Ori.livneh: Fixes for Varnishkafka gmond module [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175499 [21:43:46] (03PS1) 10Dr0ptp4kt: Make mdot webroot redirects agree with W0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175550 [21:44:52] bblack, yurikR, MaxSem , mind reviewing ^ ? yurikR, dfoy said he was gonna give you a heads up about this. this is that thing with that operator getting redirected to english instead of french. [21:45:17] dr0ptp4kt, yep, dfoy emailed me [21:46:06] yurikR: thx. i was thinking to do it extension side, but that won't work. mobilelanding.php will override whatever the extension sends. and i don't know that we want to, at least not yet, delegate all cache-control responsibilities to zb [21:46:12] (for the mdot webroot, anyway) [21:49:32] dr0ptp4kt / yurikR : while I'm thinking about related things, re: our earlier conversations about expanding X-CS and future directions beyond that: does any of this imply having public https hostnames of the form *.zero.$project.org (where currently we only do that fir $project=wikipedia)? or would it all be within *.m.$project.org? [21:50:21] bblack, we won't have any *.zero.* except wikipedia [21:50:40] we are trying to get rid of [21:50:41] it [21:50:41] (03PS4) 10Ori.livneh: Fixes for Varnishkafka gmond module [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175499 [21:50:54] I know we don't *now*, but the question is whether any of the future scope being currently considered would add them. [21:51:17] (because that would have implications for future certificate purchases) [21:51:33] we don't want to either [21:51:53] we are trying to get rid of it, but we can't du e to contracts [21:52:42] (03CR) 10Ottomata: [C: 032] Fixes for Varnishkafka gmond module [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175499 (owner: 10Ori.livneh) [21:52:51] (03CR) 10Ottomata: [V: 032] Fixes for Varnishkafka gmond module [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175499 (owner: 10Ori.livneh) [21:53:08] bblack: no need for zerodot on those other projects. i agree with yurikR and dfoy just said no need. w00t [21:53:20] ok, thanks [21:53:45] (03PS1) 10Ori.livneh: Update Varnishkafka module to 3cdb6d5 [puppet] - 10https://gerrit.wikimedia.org/r/175553 [21:53:46] ottomata: ^ [21:53:50] ha you are faster than me! [21:54:04] (03PS2) 10Ottomata: Update Varnishkafka module to 3cdb6d5 [puppet] - 10https://gerrit.wikimedia.org/r/175553 (owner: 10Ori.livneh) [21:54:39] (03CR) 10Ottomata: [C: 032 V: 032] Update Varnishkafka module to 3cdb6d5 [puppet] - 10https://gerrit.wikimedia.org/r/175553 (owner: 10Ori.livneh) [21:55:51] (03PS1) 10Awight: Enable CentralNotice client banner choice everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175556 [21:56:20] (03CR) 10BBlack: [C: 031] Make mdot webroot redirects agree with W0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175550 (owner: 10Dr0ptp4kt) [21:59:17] (03PS1) 10Ottomata: Fix for duplicate exec definition [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175558 [21:59:54] (03CR) 10Ottomata: [C: 032 V: 032] Fix for duplicate exec definition [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175558 (owner: 10Ottomata) [22:00:04] awight, AndyRussG, ejegg: Dear anthropoid, the time has come. Please deploy CentralNotice (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141124T2200). [22:00:41] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: puppet fail [22:00:45] (03PS1) 10Ottomata: Update varnishkafka module with fix for duplicate exec [puppet] - 10https://gerrit.wikimedia.org/r/175559 [22:00:47] ^ that's us [22:00:53] (03CR) 10jenkins-bot: [V: 04-1] Update varnishkafka module with fix for duplicate exec [puppet] - 10https://gerrit.wikimedia.org/r/175559 (owner: 10Ottomata) [22:00:56] psh [22:02:11] PROBLEM - puppet last run on cp1069 is CRITICAL: CRITICAL: puppet fail [22:02:17] (03PS2) 10Ottomata: Update varnishkafka module with fix for duplicate exec [puppet] - 10https://gerrit.wikimedia.org/r/175559 [22:02:41] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [22:02:49] (03CR) 10Ori.livneh: [C: 032 V: 032] Update varnishkafka module with fix for duplicate exec [puppet] - 10https://gerrit.wikimedia.org/r/175559 (owner: 10Ottomata) [22:03:00] ha, i was waiting for jenkins :) [22:03:22] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: puppet fail [22:03:52] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [22:04:30] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:04:53] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [22:05:12] PROBLEM - puppet last run on cp1070 is CRITICAL: CRITICAL: puppet fail [22:06:55] Sorry if this has been rehashed a hundred times--but I notice that Zuul is doing both the test and gate-and-submit jobs simultaneously, for #175560 and 175561. [22:07:16] Those are redundant... Once I CR+2, the test jobs should be canceled, dontcha thing? [22:07:22] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [22:07:42] (03PS1) 10Ori.livneh: Don't write empty file [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175562 [22:07:54] (03CR) 10MaxSem: [C: 031] "Makes sense to me. must-revalidate doesn't seem to be needed because it's in case the resource gets purged which never happens on the land" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175550 (owner: 10Dr0ptp4kt) [22:09:15] (03PS2) 10BBlack: Turn on r::c::ssl::sni locally for varnishes [puppet] - 10https://gerrit.wikimedia.org/r/175464 [22:09:17] (03PS2) 10BBlack: Switch LVS to use localssl at all sites [puppet] - 10https://gerrit.wikimedia.org/r/175465 [22:09:19] (03PS2) 10BBlack: remove legacy protoproxy config [puppet] - 10https://gerrit.wikimedia.org/r/175466 [22:10:03] (03PS1) 10EBernhardson: Disable LQT on office wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175563 [22:10:46] !log awight Synchronized php-1.25wmf8/extensions/CentralNotice: push CentralNotice updates (duration: 00m 06s) [22:10:52] Logged the message, Master [22:11:03] !log awight Synchronized php-1.25wmf9/extensions/CentralNotice: push CentralNotice updates (duration: 00m 06s) [22:11:06] Logged the message, Master [22:11:09] done [22:11:24] (03PS1) 10EBernhardson: Enable wgContentHandlerUseDB on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175565 [22:11:24] AndyRussG: ejegg: ok patches are pushed. about to deploy the config switch... [22:11:30] (03PS2) 10Ori.livneh: Don't write empty file [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175562 [22:11:44] Wee.... [22:11:50] (03CR) 10Ottomata: [C: 032 V: 032] Don't write empty file [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175562 (owner: 10Ori.livneh) [22:12:29] (03CR) 10Awight: [C: 032] Enable CentralNotice client banner choice everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175556 (owner: 10Awight) [22:12:37] (03Merged) 10jenkins-bot: Enable CentralNotice client banner choice everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175556 (owner: 10Awight) [22:12:53] (03PS1) 10Ottomata: Update varnishkafka module with fix for empty stats file [puppet] - 10https://gerrit.wikimedia.org/r/175566 [22:12:54] yes I win ^ ori :p [22:13:08] heh [22:13:09] (03CR) 10Ottomata: [C: 032 V: 032] Update varnishkafka module with fix for empty stats file [puppet] - 10https://gerrit.wikimedia.org/r/175566 (owner: 10Ottomata) [22:13:39] !log awight Synchronized wmf-config: Enable CentralNotice 2.5.0 client banner choice, everywhere (duration: 00m 05s) [22:13:42] Logged the message, Master [22:14:55] bblack: , MaxSem, thx for reviews. yurikR, your turn https://gerrit.wikimedia.org/r/175550 [22:15:04] (if you're not already on it) [22:15:09] dr0ptp4kt, yep ) [22:17:47] So, if iptables says "ACCEPT all -- neon.wikimedia.org anywhere" [22:18:07] shouldn't I get something if I 'telnet 11000'? [22:18:09] (03PS1) 10Yuvipanda: shinken: Re-notify about dead hosts/services less frequently [puppet] - 10https://gerrit.wikimedia.org/r/175569 [22:18:15] That works as localhost on the system with that firewall rule [22:18:25] (03PS2) 10Yuvipanda: shinken: Re-notify about dead hosts/services less frequently [puppet] - 10https://gerrit.wikimedia.org/r/175569 [22:18:30] andrewbogott: ipv4 vs ipv6? there's separate tables, but cmdline dns lookup could be either [22:19:07] bblack: does iptables --list only show the v4 rules? [22:19:09] (03CR) 10Yurik: [C: 032] Make mdot webroot redirects agree with W0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175550 (owner: 10Dr0ptp4kt) [22:19:18] (03Merged) 10jenkins-bot: Make mdot webroot redirects agree with W0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175550 (owner: 10Dr0ptp4kt) [22:19:20] RECOVERY - puppet last run on cp1069 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [22:19:37] try "iptables -vnL" for v4 and "ip6tables -vnL" for v6 [22:20:01] yurikR: thanks. you got the deployment of the change covered, right? or is that one of the auto-deploying type of changes nowadays? [22:20:26] (03CR) 10Yuvipanda: [C: 032] shinken: Re-notify about dead hosts/services less frequently [puppet] - 10https://gerrit.wikimedia.org/r/175569 (owner: 10Yuvipanda) [22:20:59] (or alternatively, test this theory by doing your telnet with an explicit ipv4 addr instead of a hostname) [22:21:32] bblack: Connection refused for 208.80.154.18 as well [22:21:52] do you have some context on where this is from? [22:22:01] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:22:30] (or to? neon doesn't listen on 11000) [22:23:11] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:24:04] bblack: icinga has recently started complaining that memcached is down on virt1000 [22:24:07] (03PS1) 10Faidon Liambotis: geoip: kill geoliteupdate in favor of geoipupdate [puppet] - 10https://gerrit.wikimedia.org/r/175571 [22:24:09] bblack: ^ [22:24:13] I can see that it's up, but the port is blocked from neon [22:24:33] So, I presume that the failure message is due to neon not being able to reach port 11000 on virt1000 [22:24:37] not (just) the change itself, but that hidden userid/licensekey [22:24:40] you might find it useful :) [22:24:41] andrewbogott: this mostly likely involves firewall rules on the network hardware then if it's labs-related [22:24:59] possible. I don't know why that would've changed last week though [22:25:03] it broke on the 18th [22:26:34] paravoid: wow, awesome. but still can't use it for testsuite data, because then my tests get broken by new data :/ [22:26:53] sure, maxmind uses a submodule as well [22:27:06] but it's handy [22:27:11] yeah in case you didn't see already, after trying out several other alternatives, I gave up and did the submodule thing [22:27:21] yup, I saw [22:28:03] the other acceptable alternative was to push the new files to S3 like the old ones and download from there, but the submodules are cleaner and I don't want to pay for increasing S3 costs in the future :) [22:29:51] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [22:30:11] yurikR, yours? ^^^ [22:30:16] andrewbogott: this isn't an iptables / firewall issue. memcached on virt1000 is only listening on 127.0.0.1 for whatever reason [22:30:29] oh, that'd do it. [22:30:35] MaxSem, i haven't depl anything [22:30:40] Hm, I wonder if maybe the issue is just that a test was added incorrectly [22:30:41] only +2ed the caching change [22:30:57] 't deploy it [22:31:19] yurikR, and it's complaining you didn't deploy it [22:31:31] MaxSem, thx, i'm an idiot [22:31:37] didn't look which repo its in ( [22:31:43] can i depl it now [22:31:44] ? [22:31:48] csteipp, ? [22:31:51] greg-g, ^ [22:32:24] (03CR) 10Faidon Liambotis: [C: 04-1] realm.pp - remove pmtpa (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/173476 (owner: 10Dzahn) [22:32:57] greg-g, would that be ok to depl https://gerrit.wikimedia.org/r/#/c/175550/ now? [22:33:12] (03CR) 10BBlack: [C: 031] geoip: kill geoliteupdate in favor of geoipupdate [puppet] - 10https://gerrit.wikimedia.org/r/175571 (owner: 10Faidon Liambotis) [22:34:08] awight, ejegg, AndyRussG, are you deploying? [22:34:31] yurikR: I believe it's done! [22:34:54] AndyRussG, ok, i'll deploying https://gerrit.wikimedia.org/r/#/c/175550/ [22:34:55] do you need some deploy slot? [22:35:02] just a minor change [22:35:05] to the config [22:35:18] Lemme check w/ awight [22:35:28] yurikR: all yours [22:35:42] here i go [22:35:54] we were just lingering in case we needed rollback, but the coast looks clear. [22:38:11] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [22:38:46] (03PS1) 10Ori.livneh: bits varnish: serve 204s for /statsv [puppet] - 10https://gerrit.wikimedia.org/r/175575 [22:38:53] ^ ottomata [22:39:14] bblack, paravoid: if you have a sec, there's a teeny-tiny VCL change in that patchset too ^ [22:39:32] what's /statsv? [22:39:41] endpoint for performance stats [22:40:03] the change itself is fine, the commit usage could explain a bit more what is that and why it's needed [22:40:06] as I have no clue :) [22:40:42] paravoid: https://gerrit.wikimedia.org/r/#/c/89359/ , see mark's response (and the date) :) [22:40:44] i'll amend the commit [22:40:46] plus a comment in the VCL as well? we didn't have one for EL either and not everyone knows what event.gif is used for [22:40:53] ohh [22:40:54] _that_ [22:40:57] I remember that :) [22:41:31] RECOVERY - Memcached on virt1000 is OK: TCP OK - 0.008 second response time on port 11000 [22:42:29] what kind of stats are you planning to collect? [22:42:39] that's not pertinent to the VCL +2 of course, I'm just curious [22:43:10] (03PS1) 10Andrew Bogott: Have virt1000 memcached listen to external IPs. [puppet] - 10https://gerrit.wikimedia.org/r/175577 [22:44:04] (03CR) 10Faidon Liambotis: [C: 04-2] "We used to have a memcache nrpe check. That's preferrable to this, for security reasons." [puppet] - 10https://gerrit.wikimedia.org/r/175577 (owner: 10Andrew Bogott) [22:44:39] andrewbogott: the blog manifest was like that, I'm not sure if we still have that manifest around or you have to go git hunting [22:44:50] !log yurik Synchronized mobilelanding.php: https://gerrit.wikimedia.org/r/#/c/175550/ (duration: 00m 05s) [22:44:56] Logged the message, Master [22:45:10] dr0ptp4kt, MaxSem ^ [22:45:15] paravoid: ok… any idea what happened last week that caused icinga to start complaining? I see some patches in that area but no smoking gun. [22:45:18] yurikR: thx [22:45:39] no, no clue [22:47:01] ori: the mwdeploy key added to labs/private recently, that's not secret right? [22:47:27] (03PS2) 10Ori.livneh: bits varnish: serve 204s for /statsv [puppet] - 10https://gerrit.wikimedia.org/r/175575 [22:47:32] bblack: different key [22:47:37] paravoid: the production memcached servers listen on 0.0.0.0, and virt1000 uses the same module. Setting up different monitoring for virt1000 may involve a lot of extra code :( [22:47:48] ori: it's a yes or no question :) [22:47:52] no [22:48:00] I guess this is one more reason to integrate wikitech with the other prod wikis [22:48:05] andrewbogott: yes [22:48:10] paravoid: amended [22:48:19] andrewbogott: not much extra code, no [22:48:21] just making sure, because labs has been privater than it used to be for a couple of commits, and I want to make sure nobody's relying on that before I open it back up [22:48:33] andrewbogott: I made the bind address configurable, specifically for this use case [22:48:43] ori: ditto the passwords::phab commit? Ie5677702115b8347d6c867e0696242a35b720ef8 ? [22:49:12] andrewbogott: there's an "ip" parameter to the module [22:49:23] bblack: yep [22:49:41] paravoid: you mean the one that my patch changes? [22:49:59] !log opening up access to labs/private repo in gerrit perms [22:50:03] Logged the message, Master [22:50:26] andrewbogott: well.. yes :) [22:50:39] and if you look at the module [22:50:48] it actually checks for 127.0.0.1 and sets up an nrpe check instead [22:50:53] hm, so it does [22:50:58] So, something else broke /that/ I guess. [22:51:02] * andrewbogott digs deeper [22:51:08] well, yes [22:51:09] (03PS1) 10Legoktm: Update README [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175579 [22:51:10] if $memcached_ip == '127.0.0.1' { [22:51:28] someone renamed the parameter [22:51:33] 65f6274c85c9f42a9023df6cc161da808a3f2413 [22:51:34] ori :) [22:51:36] (03PS1) 10Ottomata: Add logrotate confs for per instance varnishkafka stats files [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175580 [22:52:10] ori ^ [22:52:20] andrewbogott: that's the culprit [22:52:26] merged Nov 17th [22:52:38] you just need to change that line above to "if $ip == '127.0.0.1'" [22:52:43] (03PS1) 10Andrew Bogott: Replace memcached_ip with ip in one last place [puppet] - 10https://gerrit.wikimedia.org/r/175583 [22:53:12] (03CR) 10Ori.livneh: [C: 031] Replace memcached_ip with ip in one last place [puppet] - 10https://gerrit.wikimedia.org/r/175583 (owner: 10Andrew Bogott) [22:53:14] (03PS2) 10Ottomata: Add logrotate confs for per instance varnishkafka stats files [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175580 [22:53:23] (03Abandoned) 10Andrew Bogott: Have virt1000 memcached listen to external IPs. [puppet] - 10https://gerrit.wikimedia.org/r/175577 (owner: 10Andrew Bogott) [22:53:35] the module is much nicer now though [22:53:44] that memcached_ parameter prefix was crap [22:53:54] paravoid: that's clearly it; thank you [22:54:21] (03CR) 10Andrew Bogott: [C: 032] Replace memcached_ip with ip in one last place [puppet] - 10https://gerrit.wikimedia.org/r/175583 (owner: 10Andrew Bogott) [22:54:27] paravoid: :) [22:54:28] thanks [22:54:36] (03CR) 10Faidon Liambotis: [C: 032] "That's awesome, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/175575 (owner: 10Ori.livneh) [22:54:50] (03CR) 10Ori.livneh: [C: 031] Add logrotate confs for per instance varnishkafka stats files [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175580 (owner: 10Ottomata) [22:56:02] ori, don't mind, but do you want all reqs to have to end in the slash ? [22:56:06] no: [22:56:12] /statsv?key=val [22:56:12] ? [22:56:47] yeah, i was going to use the path too [22:57:05] (03CR) 10Legoktm: [C: 032] Update README [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175579 (owner: 10Legoktm) [22:57:21] (03Merged) 10jenkins-bot: Update README [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175579 (owner: 10Legoktm) [22:57:31] 00:42 < paravoid> what kind of stats are you planning to collect? [22:57:31] (03CR) 10Ottomata: [C: 032 V: 032] Add logrotate confs for per instance varnishkafka stats files [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/175580 (owner: 10Ottomata) [22:57:35] ori: ^ [22:57:38] just curious :) [22:57:54] (03CR) 10CSteipp: [C: 04-1] "Let's wait until after Wed and we can discuss" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175565 (owner: 10EBernhardson) [22:58:00] paravoid: timing data for ajax and js functions [23:00:08] (03PS1) 10Ottomata: Update varnishkafka module with logrotate changes [puppet] - 10https://gerrit.wikimedia.org/r/175590 [23:01:18] (03CR) 10Ottomata: [C: 032 V: 032] Update varnishkafka module with logrotate changes [puppet] - 10https://gerrit.wikimedia.org/r/175590 (owner: 10Ottomata) [23:01:26] !log legoktm Synchronized README: Updating README https://gerrit.wikimedia.org/r/175579 (duration: 00m 05s) [23:01:28] Logged the message, Master [23:01:54] logmsgbot: did you just deploy a README fix? :) [23:02:01] greg-g: his first ever deploy! [23:02:04] i'm showing him the ropes [23:02:07] be very afraid [23:02:12] oh, awesome! [23:02:27] and uh, tab complete fail [23:02:32] legoktm: congrats! :) [23:02:42] legoktm: Congratulations. :-) [23:02:56] (03CR) 10Dzahn: [C: 032] "bug-attachment.wikimedia.org is an alias for zirconium.wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/175128 (owner: 10Dzahn) [23:03:12] Today readme, tomorrow broken wikis :p [23:04:54] legoktm: Congrats! [23:04:56] paravoid: we might want to push counter metrics through statsv one day...we will see [23:05:06] * jamesofur notes another name on his list of people to bother for deploys [23:05:07] like, banner impressinos +1, or something [23:05:31] did someone say banner impressions tats [23:05:34] stats [23:05:36] that can be totally be fabricated by anyone though, no? [23:05:39] !log legoktm Synchronized php-1.25wmf9/extensions/NavigationTiming: Update NavigationTiming for https://gerrit.wikimedia.org/r/#/c/175584/ (duration: 00m 05s) [23:05:41] Logged the message, Master [23:05:45] paravoid, maybe so, maybe so [23:06:00] and why would we do that from clients? [23:06:09] that has not had a lot of though about it, which is why ori didn't mention it, but its something we might want to have one day [23:06:23] why would we do that? [23:06:24] just an example, app stats from clients is useful i think, no? [23:06:27] paravoid: there's a plan for that, but i can only explain in a bit [23:06:32] ok [23:06:40] otherwise we have to parse through lots of request logs jsut to figure stuff out [23:09:33] What is this? https://integration.wikimedia.org/ci/job/mwext-DonationInterface-testextension/256/console [23:09:39] 23:08:35 IOError: Lock at '/srv/ssd/jenkins-slave/workspace/mwext-DonationInterface-testextension/src/vendor/.git/HEAD.lock' could not be obtained [23:09:59] awight: was probably a transient thing? [23:10:06] I rechecked... [23:10:46] laters all, i'm out for thanksgiving holiday, see yaaas! [23:11:18] !log legoktm Synchronized php-1.25wmf8/extensions/NavigationTiming/README.md: Update NavigationTiming https://gerrit.wikimedia.org/r/175585 (duration: 00m 05s) [23:11:20] Logged the message, Master [23:12:13] bblack: to disable caching for one varnish backend, "return (pass);" is reasonable, right [23:13:13] mutante: I'm really not sure, it depends a lot on the context of all that. (sorry, there are no easy answers in varnishland!) [23:13:46] is this for bugz? [23:14:20] bblack: yea, for old-bz, and i see it being used on people.wikimedia.org to disable caching of public_html dirs [23:14:50] yes, return pass during vcl_recv should do [23:15:22] and appears in actually http://www.mediawiki.org/wiki/Manual:Varnish_caching [23:15:32] ok, thanks [23:16:26] it's just sometimes with the other (non-misc) instances, things get complicated with front-vs-back, etc [23:17:32] ah,ok, just touching misc, yep [23:21:44] paravoid: the banner impression stats are a tale of much woe and sadness. right now it's a request to the app servers. instead of a 204, there's a dummy, blank special page called Special:RecordImpression. now that /a/common* has been cleaned up, Special:RecordImpression is my leading contender for "worst thing in WMF universe" [23:22:16] (03PS1) 10Dzahn: move old-bugzilla behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/175595 [23:22:26] I've dealt with an outage that surrounded Special:RecordImpression [23:22:39] don't remember the details, I just remember that this page exists [23:27:58] uncacheable things embedded in cacheable content are all my leading contenders for "worst things in the WMF universe" [23:30:38] (03PS2) 10Dzahn: move old-bugzilla behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/175595 [23:31:45] ori: yep. I want Varnish to respond "" without hitting PHP. Or do you have a paradigm fix? [23:32:46] This is really becoming an issue for us: 23:31:59 IOError: Lock at '/srv/ssd/jenkins-slave/workspace/mwext-DonationInterface-testextension/src/vendor/.git/HEAD.lock' could not be obtained [23:33:06] anyone have clues what's causing the fail? [23:33:30] a git process was killed before it could clean up. someone needs to rm it. just override jenkins for now? [23:33:40] (03CR) 10John F. Lewis: [C: 031] move old-bugzilla behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/175595 (owner: 10Dzahn) [23:34:05] awight: I never understood why we should have a Special:RecordImpression anyway. If you trust the request logs, isn't it enough to have the request logs for the resource that you're interested in? And if you don't trust the request logs, wouldn't you need a Special:RecordRecordImpressionImpression, ad infinitum? [23:34:28] ori: I'd rather ask someone to rm, cos we heavily rely on the php unit tests for that extension. [23:35:10] ori: re Special:RecordImpression, the issue is that we need a record of *not* displaying a banner, as well. [23:35:57] Can someone kill rm -rf /srv/ssd/jenkins-slave/workspace/mwext-DonationInterface-testextension/ for us? [23:36:02] s/kill// [23:36:02] !log gallium: rm -f'd /srv/ssd/jenkins-slave/workspace/mwext-DonationInterface-testextension/src/vendor/.git/HEAD.lock [23:36:05] Logged the message, Master [23:36:10] ori: while you have my brain on these things, can you explain to me what's happening with e.g.: curl -v 'http://en.wikipedia.org/w/index.php?title=MediaWiki:Wikiminiatlas.js&action=raw&ctype=text/javascript&smaxage=21600&maxage=86400' ? [23:36:11] awight: why -r? :P [23:36:11] ori: that's awesome, thank you. [23:36:53] (it's among our top frontend hits, and returns 200 OK with no content and uncacheable to the user?) [23:37:50] (03PS1) 10Dzahn: old-bugzilla: switch over to misc-web [dns] - 10https://gerrit.wikimedia.org/r/175601 [23:37:50] bblack: the 200 OK with no content is because someone blanked the article [23:38:05] bblack: see https://en.wikipedia.org/w/index.php?title=MediaWiki:Wikiminiatlas.js [23:38:39] they did? [23:39:04] it's a blank page [23:39:13] bblack: I know... please help! This month might be a bad time to replace the existing functionality, but building a parallel way of recording the same data, that we could enable in January, would be a great service. [23:39:14] the reason it's being requested is another matter; it's probably in some Common.js [23:39:15] was it ever a non-blank page? [23:39:37] bblack: oh, sorry I see you're talking about stupid wikiminiatlas. that's another matter. [23:39:48] ori: there are several index.php?title=something.js like that which top the charts in our frontend request hits. if they're really empty, then yeah it would be awesome to remove references to them so they stop getting fetched. [23:40:04] I'm more worried about those that aren't empty [23:40:09] :) [23:40:13] not that you're wrong [23:40:14] well sure that too :) [23:40:21] but there's just some really horrible shit in there [23:40:23] these 5 top the list in my usual samples: [23:40:24] 311.74 RxURL /w/index.php?title=MediaWiki:RefToolbar.js&action=raw&ctype=text/javascript [23:40:27] 304.30 RxURL /w/index.php?title=MediaWiki:Gadget-refToolbarBase.js&action=raw&ctype=text/javascript [23:40:30] yeah I know [23:40:30] paravoid: there's no history, but i think it once existed and was deleted [23:40:31] 295.41 RxURL /w/index.php?title=MediaWiki:Wikibugs.js&action=raw&ctype=text/javascript [23:40:33] 294.41 RxURL /w/index.php?title=MediaWiki:Wikiminiatlas.js&action=raw&ctype=text/javascript&smaxage=21600&maxage=86400 [23:40:36] 242.52 RxURL /w/index.php?title=MediaWiki:OSM.js&action=raw&ctype=text/javascript&smaxage=21600&maxage=86400 [23:40:45] there's a BZ that I've listed those somewhere [23:40:51] (top the list of all inbound requests by unique url) [23:40:56] https://en.wikipedia.org/wiki/Special:WhatLinksHere/MediaWiki:Wikiminiatlas.js [23:40:59] hmm, someone loading gadgets from a rmeote wiki? [23:41:31] where is that Gadgets 2.0? :P [23:41:43] bblack, paravoid: as for tracking down usage, you can run 'mwgrep' from tin, which uses ElasticSearch to query the MediaWiki: and User: namespaces of all wikis for a particular string [23:42:10] I've used that before too :) [23:42:19] it's very handy [23:42:39] bd808 also gave you credit for it recently on wmfall [23:42:45] (03CR) 10Dzahn: [C: 032] move old-bugzilla behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/175595 (owner: 10Dzahn) [23:43:17] I also think you could legitimately claim that it's not an ops problem and escalate it to another team via a phabricator task or something. But it'll probably go to MediaWiki Core :/ [23:43:19] the RefToolBar.js one sends 30KB of javascript and has < Cache-Control: private, s-maxage=0, max-age=0, must-revalidate [23:43:30] We have so many cool tools it's hard to keep track of them. [23:44:50] bblack: that one is a gadget, by the looks of it: see [23:44:55] refToolbar, adds a "cite" button to the editing toolbar for quick and easy addition of commonly used citation templates. (View source | Export) [23:44:55] Uses: Gadget-refToolbar.js [23:44:55] Enabled for everyone by default. [23:45:21] as I was saying most recently at the zurich hackathon [23:45:30] we should set fire to most of that stuff [23:45:31] gadgets enabled for everyone by default, especially on popular wikis, should really die [23:45:38] yeah [23:45:55] the attack vector, of all things [23:47:03] paravoid: Gadgets enabled by default aren't that bad if they use ResourceLoader [23:47:15] the thing is: Gadgets support ResourceLoader, people just don't always take advantyage of it [23:47:17] and get CR ;) [23:47:20] Unfortunately, ResourceLoader usage is still opt-in for Gadgets because... well legacy [23:47:29] like loding using these nice URLs from a different wiki [23:47:54] (03PS1) 10BryanDavis: Use Monolog provider for beta logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175604 [23:47:58] RoanKattouw, the main issue is still cross-wiki usage. where's Gadgets 2.0? [23:48:04] Ask legoktm :P [23:48:10] what AaronSchulz said [23:48:13] (03CR) 10BryanDavis: [C: 04-1] "Needs fixing for a bad class name" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175604 (owner: 10BryanDavis) [23:48:17] i uh.... [23:48:19] (03PS2) 10BryanDavis: Use Monolog provider for beta logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175604 [23:48:33] random people writing javascript that noone reviews and pushing it to millions of users [23:48:36] what could possibly go wrong [23:49:01] (03CR) 10John F. Lewis: [C: 031] old-bugzilla: switch over to misc-web [dns] - 10https://gerrit.wikimedia.org/r/175601 (owner: 10Dzahn) [23:49:23] holy motherfucking fuck [23:49:25] https://en.wikipedia.org/wiki/MediaWiki:Gadget-refToolbar.js [23:49:30] paravoid: until we stop letting those random people add raw HTML onto every page view, something that gadgets can't even do, I don't see gadgets as that bad :P [23:50:02] "okay, so we have RL. screw that, load manually" [23:51:30] (03PS3) 10BryanDavis: Use Monolog provider for beta logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175604 [23:52:07] bblack, I'm not sure I have sufficient mad skillz to fix RefToolbar, but I'll post to VPT [23:52:41] MaxSem: it should not be enabled by default -- can you please ask that it be made optional? [23:52:59] ori, MaxSem: What's the issue with RefToolbar? [23:53:01] filed https://phabricator.wikimedia.org/T75810 for sunsetting non-RL gadgets [23:53:04] they would just object to it [23:53:19] it should just be fixed to use RL properly [23:53:34] woot progress :) [23:53:35] kaldari|2, importScript() [23:54:08] ori, will not help unless you will also kill importScript ;) [23:54:39] kaldari|2: it uses importScript() to load additional resources, which requires a full, additional round-trip to the server to fetch, all the while the loading and rendering of content is blocked. [23:54:56] ...and the servers are hurt [23:55:38] ori: I haven't had a chance to fix up https://gerrit.wikimedia.org/r/#/c/166363/ yet [23:56:03] ori, MaxSem: ah. RefToolbar was originally in Common.js and only loaded in the editing interface. Not sure why that was changed. [23:57:13] ori, MaxSem: actually looks like it's still only loaded in editing interface, so not the worst thing [23:57:43] but would be good to fix :) [23:57:49] well, because of that site is still up. but that's doesn't mean that it's not bad:P [23:58:08] this is probably a question for #-qa, but before i join and make a fool out of myself [23:58:21] didn't we have an account with some service that had VMs with various platforms? [23:58:38] MaxSem, ori: RefToolbar needs to be moved into core really. [23:58:49] or an extension [23:58:52] for Satan's sake, no [23:58:59] I want a single URL to be fetched from as many browsers/OSes possible [23:59:16] extension, yes - but even then, people keep adjusting it for local wiki's needs [23:59:16] MaxSem: why not? [23:59:23] true [23:59:35] so, like 700 extensions? :P [23:59:39] paravoid: yea, there is a browserstack.com login somewhere i believe [23:59:51] paravoid: I use Sauce Labs, which is free for open source projects