[00:52:48] (03PS1) 10Ori.livneh: base_jit_size: 100 Mb -> 200 Mb [puppet] - 10https://gerrit.wikimedia.org/r/209995 [00:54:38] (03PS2) 10Ori.livneh: base_jit_size: 100 Mb -> 200 Mb [puppet] - 10https://gerrit.wikimedia.org/r/209995 [02:18:40] !log l10nupdate Synchronized php-1.26wmf4/cache/l10n: (no message) (duration: 06m 19s) [02:18:55] Logged the message, Master [02:23:28] !log LocalisationUpdate completed (1.26wmf4) at 2015-05-11 02:22:25+00:00 [02:23:34] Logged the message, Master [02:39:25] !log l10nupdate Synchronized php-1.26wmf5/cache/l10n: (no message) (duration: 05m 37s) [02:39:31] Logged the message, Master [02:43:46] !log LocalisationUpdate completed (1.26wmf5) at 2015-05-11 02:42:42+00:00 [02:43:55] Logged the message, Master [03:10:27] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [03:20:16] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:29:17] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [03:32:27] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [03:41:15] (03PS1) 10Ori.livneh: Add vlogdump to varnish module [puppet] - 10https://gerrit.wikimedia.org/r/209999 [03:45:26] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [03:48:36] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [04:00:36] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [04:10:07] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [04:15:43] !log restarted hhvm on mw1020. lots of fatal noise about sudo killall update-notifier [04:15:46] sudo mv /usr/bin/update-notifier /usr/bin/update-notifier.real [04:15:49] echo -e '#!/bin/bash\nwhile :; do /bin/sleep 86400; done' | sudo tee /usr/bin/update-notifier [04:15:50] Logged the message, Master [04:15:52] grr [04:17:43] !log restarted hhvm on mw1020. lots of fatal noise about N4HPHP13DataBlockFullE [04:17:46] Logged the message, Master [04:26:18] (03PS1) 10Tim Landscheidt: Tools: Puppetize database aliases as host resources [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) [04:27:00] (03CR) 10jenkins-bot: [V: 04-1] Tools: Puppetize database aliases as host resources [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [04:32:49] (03PS2) 10Tim Landscheidt: Tools: Puppetize database aliases as host resources [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) [04:33:37] (03CR) 10jenkins-bot: [V: 04-1] Tools: Puppetize database aliases as host resources [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [04:35:17] (03CR) 10Tim Landscheidt: "That looks like a Jenkins failure; need to "recheck" some time." [puppet] - 10https://gerrit.wikimedia.org/r/210000 (https://phabricator.wikimedia.org/T63897) (owner: 10Tim Landscheidt) [04:49:26] springle: doh, I know what that's about -- https://gerrit.wikimedia.org/r/#/c/209995/ should fix it [04:50:08] ah [04:50:15] i'll make it go away by downgrading the package for now [04:55:01] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon May 11 04:53:58 UTC 2015 (duration 53m 57s) [04:55:05] Logged the message, Master [04:56:54] (03PS1) 10Ori.livneh: Canary app servers: Set HHVM's base TC cache size to 183.5 Mb, matching labs [puppet] - 10https://gerrit.wikimedia.org/r/210002 [05:03:51] (03PS2) 10Ori.livneh: Canary app servers: Set HHVM's base TC cache size to 183.5 Mb, matching labs [puppet] - 10https://gerrit.wikimedia.org/r/210002 [05:05:17] (03CR) 10Ori.livneh: [C: 032] Canary app servers: Set HHVM's base TC cache size to 183.5 Mb, matching labs [puppet] - 10https://gerrit.wikimedia.org/r/210002 (owner: 10Ori.livneh) [05:10:47] !log upgrading canary appservers to 3.6.1+dfsg1-1+wm2 [05:10:52] Logged the message, Master [05:15:26] PROBLEM - DPKG on mw1019 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:15:47] PROBLEM - DPKG on mw1022 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:15:57] PROBLEM - DPKG on mw1024 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:15:57] PROBLEM - DPKG on mw1025 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:15:57] PROBLEM - DPKG on mw1018 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:16:27] PROBLEM - DPKG on mw1023 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:19:47] RECOVERY - DPKG on mw1023 is OK: All packages OK [05:20:17] RECOVERY - DPKG on mw1019 is OK: All packages OK [05:20:46] RECOVERY - DPKG on mw1022 is OK: All packages OK [05:20:57] RECOVERY - DPKG on mw1024 is OK: All packages OK [05:20:57] RECOVERY - DPKG on mw1025 is OK: All packages OK [05:20:57] RECOVERY - DPKG on mw1018 is OK: All packages OK [05:24:37] PROBLEM - DPKG on mw1115 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:24:47] PROBLEM - DPKG on mw1119 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:25:17] PROBLEM - DPKG on mw1117 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:25:26] PROBLEM - DPKG on mw1118 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:25:46] PROBLEM - DPKG on mw1114 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:25:46] PROBLEM - DPKG on mw1116 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [05:26:58] RECOVERY - DPKG on mw1118 is OK: All packages OK [05:27:17] RECOVERY - DPKG on mw1116 is OK: All packages OK [05:27:48] RECOVERY - DPKG on mw1115 is OK: All packages OK [05:27:57] RECOVERY - DPKG on mw1119 is OK: All packages OK [05:28:27] RECOVERY - DPKG on mw1117 is OK: All packages OK [05:30:27] RECOVERY - DPKG on mw1114 is OK: All packages OK [05:56:42] (03PS1) 10Springle: whitespace consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210003 [05:58:42] (03CR) 10Springle: [C: 032] whitespace consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210003 (owner: 10Springle) [05:58:48] (03Merged) 10jenkins-bot: whitespace consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210003 (owner: 10Springle) [06:00:13] (03PS1) 10Springle: expand list of CODFW slaves ready for traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210004 [06:00:19] (03CR) 10jenkins-bot: [V: 04-1] expand list of CODFW slaves ready for traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210004 (owner: 10Springle) [06:04:57] (03CR) 10Hoo man: "Hardware specs as comments would be awesome (like we have for eqiad)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210004 (owner: 10Springle) [06:05:11] (03PS2) 10Springle: expand list of CODFW slaves ready for traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210004 [06:08:27] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [06:10:07] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [06:29:37] * springle welcomes jynus [06:30:10] _joe_: paravoid godog akosiaris ^ \o/ [06:30:47] PROBLEM - puppet last run on cp3042 is CRITICAL Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:31:55] springle: \o/ \o/ [06:32:02] jynus: welcome! [06:32:14] godog, thank you! [06:32:57] PROBLEM - puppet last run on mw1052 is CRITICAL Puppet has 1 failures [06:33:07] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 1 failures [06:33:16] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:33:28] PROBLEM - puppet last run on mw1123 is CRITICAL Puppet has 2 failures [06:33:39] <_joe_> jynus: hola! [06:33:45] jynus: don't be too scared by bot activity, they are usually quieter :) [06:33:47] PROBLEM - puppet last run on mw1129 is CRITICAL Puppet has 1 failures [06:34:01] _joe_, hola, indeed [06:34:04] <_joe_> this is mod_passenger e' clock [06:34:08] <_joe_> *o [06:34:23] <_joe_> when the puppetmaster rotates logs, and mod_passenger chokes [06:34:32] godog, it is the public channel that scares me [06:34:43] <_joe_> jynus: public and _logged_ [06:34:47] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 2 failures [06:34:48] PROBLEM - puppet last run on mw2003 is CRITICAL Puppet has 1 failures [06:34:48] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:34:48] <_joe_> you'll get used to it [06:34:50] but I have yet no access to the non-public ones [06:35:01] jynus: welcome, Jaime! [06:35:06] moritzm, thank you [06:38:43] (03CR) 10Springle: "@Hoo, ack. I'll do a follow up for that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210004 (owner: 10Springle) [06:46:58] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:37] RECOVERY - puppet last run on mw1052 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:47:48] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:47:48] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:47:48] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:48:01] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:48:01] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:48:01] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:48:01] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:48:16] RECOVERY - puppet last run on mw1123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:27] RECOVERY - puppet last run on mw1129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:24] 6operations, 10ops-codfw, 5Patch-For-Review: Set up missing PDUs in codfw and eqiad - https://phabricator.wikimedia.org/T84416#1275493 (10fgiunchedi) thanks @papaul! so all rows have at least network equipment connected to the CDUs and should be showing current figures, anything different in config for those? [07:10:00] 6operations, 10ops-eqiad, 5Patch-For-Review: humidity sensors in eqiad row c/d showing alarms - https://phabricator.wikimedia.org/T98721#1275501 (10fgiunchedi) 3NEW a:3Cmjohnson [07:26:23] 6operations: SSL cert for svn.wikimedia.org has expired, should move behind misc-web - https://phabricator.wikimedia.org/T98723#1275517 (10hashar) 3NEW [07:34:16] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [07:52:39] (03CR) 10Filippo Giunchedi: etcd: create puppet module (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) (owner: 10Giuseppe Lavagetto) [07:55:17] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [07:58:36] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [08:11:25] 6operations, 6WMF-NDA-Requests: Need access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1275543 (10Qgil) [08:42:21] 6operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T98726#1275579 (10fgiunchedi) 3NEW [08:43:28] ACKNOWLEDGEMENT - RAID on ms-be2007 is CRITICAL 1 failed LD(s) (Offline) Filippo Giunchedi T98726 [08:43:37] ACKNOWLEDGEMENT - puppet last run on ms-be2007 is CRITICAL Puppet has 1 failures Filippo Giunchedi T98726 [08:53:38] 6operations, 6Project-Creators: please create the audits-data-retention tag - https://phabricator.wikimedia.org/T97127#1275588 (10Aklapper) [09:03:57] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [09:12:44] 6operations: Need access to WMF-NDA group and operations project (?) - https://phabricator.wikimedia.org/T98727#1275593 (10jcrespo) 3NEW [09:13:46] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [09:20:27] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [09:23:40] 6operations: Access for jcrespo to WMF-NDA group and operations project (?) - https://phabricator.wikimedia.org/T98727#1275617 (10Aklapper) [09:23:57] 6operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#1275619 (10akosiaris) >>! In T88761#1274162, @Dzahn wrote: > I'll take planet. So i open a new ticket to request a VM for it and then link it here? Great. I am still... [09:23:57] jynus: hey, welcome :) [09:24:13] hello, paravoid, thank you [09:26:07] jynus: hello and welcome to the team :-) [09:26:21] hello, akosiaris, thank you very much [09:26:44] trying hard to keep with all the accounts and permissions and nicks :-) [09:27:06] <_joe_> jynus: don't worry, after a couple of months you'll get used to it [09:27:10] (I'm Faidon) [09:27:19] <_joe_> hell, nowadays I feel comfortable in gerrit [09:27:23] * _joe_ giuseppe [09:27:53] hehe indeed, it'll take a bit of /whois at the beginning (I'm filippo) [09:28:43] well, I do not even have a cloak yet :-( [09:29:44] * akosiaris is alex [09:29:46] <_joe_> jynus: it will take time I guess [09:30:00] <_joe_> akosiaris: I guess he figured that out already [09:30:02] <_joe_> :P [09:30:11] * paravoid off, taking another flight [09:30:16] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 612 [09:30:33] yes, I understand. Just bear my ignorance of many things for a few days! :-) [09:30:40] _joe_: I was just trying to bombard him with even moaar information to see if he has a breaking point :P [09:30:45] <_joe_> ahahahahahah [09:30:52] ;-) [09:30:53] :-) akosiaris [09:31:37] well, I am accostumed actually, so it is easier here- I only have one "client" here [09:32:31] as opposed to 20 at the same time [09:35:16] RECOVERY - check_mysql on db1008 is OK: Uptime: 2148575 Threads: 1 Questions: 6635932 Slow queries: 14183 Opens: 36528 Flush tables: 2 Open tables: 64 Queries per second avg: 3.088 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:38:25] I have been told that there may be an ops hangout later- as I do not yet have access to the ops lists, please do not forget about me! [09:40:16] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [09:44:39] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1275659 (10hashar) [09:45:01] Hi jynus [09:45:11] Hello, Nemo_bis [09:45:21] Jaime Crespo here [09:45:39] Federico, another Italian (volunteer) [09:45:47] Your clients are in the millions! Make sure not to be noticed for a while. :P [09:46:01] pleased to meet you [09:46:05] right :-) [10:09:57] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [10:11:37] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [10:22:23] ori: u there? [10:31:22] 6operations, 6Commons, 7Wikimedia-log-errors: internal_api_error_DBQueryError: Database query error - https://phabricator.wikimedia.org/T98706#1275739 (10Steinsplitter) p:5Normal>3High API request failed (internal_api_error_DBQueryError): [e6308580] Database query error API request failed (internal_api_e... [10:33:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor inline comments on my part" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) (owner: 10Giuseppe Lavagetto) [10:36:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comment, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/209739 (owner: 10KartikMistry) [10:54:22] (03CR) 10Hashar: [C: 032] Switch beta udp2log host to deployment-fluorine [tools/scap] - 10https://gerrit.wikimedia.org/r/209830 (https://phabricator.wikimedia.org/T98289) (owner: 10BryanDavis) [10:54:42] (03Merged) 10jenkins-bot: Switch beta udp2log host to deployment-fluorine [tools/scap] - 10https://gerrit.wikimedia.org/r/209830 (https://phabricator.wikimedia.org/T98289) (owner: 10BryanDavis) [10:56:47] PROBLEM - puppet last run on mw2069 is CRITICAL puppet fail [11:02:07] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [11:03:38] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60609 bytes in 0.652 second response time [11:15:06] RECOVERY - puppet last run on mw2069 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [11:24:36] PROBLEM - High load average on labstore1001 is CRITICAL 87.50% of data above the critical threshold [24.0] [11:27:48] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [11:35:16] (03PS1) 10Dereckson: Content namespaces configuration on he.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210021 (https://phabricator.wikimedia.org/T98709) [11:38:30] (03PS1) 10Hashar: labs_vmbuilder: copy all apt configuration [puppet] - 10https://gerrit.wikimedia.org/r/210024 [11:38:32] (03PS1) 10Hashar: labs_vmbuilder: sort files/postinst.copy [puppet] - 10https://gerrit.wikimedia.org/r/210025 [11:40:31] (03CR) 10Hashar: "We might want to do the same in labs_bootstrapvz but I haven't looked whether it uses recursive copy." [puppet] - 10https://gerrit.wikimedia.org/r/210024 (owner: 10Hashar) [12:07:02] (03PS4) 10KartikMistry: Install new Apertium packages for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/209739 [12:07:33] akosiaris: I could use more of your time with with this Nova routing issue. Other users seem to have it working properly which makes me think that something interesting is happening on our end. [12:07:34] (03CR) 10KartikMistry: Install new Apertium packages for ContentTranslation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/209739 (owner: 10KartikMistry) [12:07:43] (03PS5) 10KartikMistry: Install new Apertium packages for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/209739 [12:07:58] Do you have time to work on this with me today or tomorrow? (I could potentially stay up late tonight to overlap your morning if that’s better) [12:08:18] andrewbogott: I have started looking at it already [12:08:33] oh, great! [12:08:38] there is something interesting about those rules Antonio Messina is pointing out [12:08:45] we do have them, but they are never reached [12:09:11] there is not a single packet matching those rules [12:09:18] be warned that I added one of his suggested rules by hand for a single IP. [12:09:26] I noticed [12:09:37] OK, just as long as that didn’t confuse you :) [12:09:42] 6operations, 10Datasets-General-or-Unknown: snaphot1004 running dumps very slowly, investigate - https://phabricator.wikimedia.org/T98585#1275861 (10ArielGlenn) The grief during deploys was due to swap due to the aforementioned memory leak. Except for April 23, don't know what that was. In the meantime: pleas... [12:10:03] no, no worries... what does confuse me is there are two rules that match always. Lemme finish updating the ticket [12:10:09] thanks [12:10:49] I’m happy to dig in the source and track what’s happening there, if you’re far enough along to have questions about that. [12:11:04] (I mean, I’m sure you can do that as well but you may be less patient with python :) ) [12:11:18] me? less patient with python ? [12:11:25] it's my language of choice these days [12:11:32] more like these years ... [12:12:01] ok, great. I can never remember who’s pro- and who’s anti- [12:12:07] Although most of the team seems to lean pro- [12:12:42] it's perl, ruby and javascript the ones you are talking about :P [12:14:21] (03CR) 10Alexandros Kosiaris: [C: 032] Install new Apertium packages for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/209739 (owner: 10KartikMistry) [12:21:06] kart_: apertium auto-reloaded ^ [12:21:21] kart_: wanna check that everything's fine ? [12:26:39] akosiaris: waiting to update puppet in beta :) [12:26:55] akosiaris: I can see language pairs in sca1001, that's good! [12:27:13] kart_: cool! [12:27:36] andrewbogott: so, I got something working but it's kind of weird [12:28:08] akosiaris: you added rules by hand? [12:28:32] andrewbogott: yes [12:28:39] so what I got working for a while was [12:28:59] bastion-restricted1 pinging util-abogott [12:29:15] and util-abogott replying correctly [12:29:35] meh, will reply on ticket [12:29:35] I 'll anyway have to do it [12:30:07] So the issue is that nova is added too many rules? Like, it’s blocking the traffic as well as enabling it? [12:30:16] Maybe there’s a dmz setting we can add someplace. [12:30:28] more like accepting way too early [12:31:06] oh… that’s maybe harder to fix. [12:31:19] Well, at least we can follow up on that email thread if it’s clear what’s happening. [12:35:02] Does someone know where the nginx stats code lives? https://phabricator.wikimedia.org/T45647#1179843 [12:36:15] andrewbogott: https://phabricator.wikimedia.org/T96924#1275904 [12:36:50] andrewbogott: do not the -I POSTROUTING 1 which means my rule is the very very first rule in the POSTROUTING chain being evaluated [12:37:51] Yeah, the ‘everyone needs a floating IP’ thing is annoying but might be OK. [12:38:41] also, apart from the ordering thing, your rule was doomed because it specified the destination VM [12:39:11] which i figured out after reading the commit message better [12:39:32] and that was a pure Greek to English translation [12:39:40] s/better/once more/ [12:39:59] (03PS1) 10Hashar: labs: support injecting tenant in firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/210032 [12:41:59] (03PS1) 10KartikMistry: Beta: CX: Add missing 'tt' language [puppet] - 10https://gerrit.wikimedia.org/r/210034 [12:42:48] (03CR) 10Hashar: "I am not a fan of having firstboot.env stick between runs. That is certainly error prone :/" [puppet] - 10https://gerrit.wikimedia.org/r/210032 (owner: 10Hashar) [12:43:12] akosiaris: ^^ [12:44:26] (03CR) 10Alexandros Kosiaris: [C: 032] "Let's see if this break anything. I 've failed to find a single reference to uuid-generator in multiple repos though." [puppet] - 10https://gerrit.wikimedia.org/r/209258 (owner: 10Alexandros Kosiaris) [12:44:49] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: CX: Add missing 'tt' language [puppet] - 10https://gerrit.wikimedia.org/r/210034 (owner: 10KartikMistry) [12:46:47] akosiaris: one more thing, beta still don't have new apertium packages. [12:46:54] akosiaris: deployment-apertium01 [12:47:12] kart_: wait for puppet [12:47:29] akosiaris: ok :) [12:52:56] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [12:54:28] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [12:56:37] PROBLEM - puppet last run on virt1000 is CRITICAL Puppet has 1 failures [12:58:47] PROBLEM - puppet last run on strontium is CRITICAL Puppet has 1 failures [13:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150511T1300). [13:00:27] PROBLEM - puppet last run on palladium is CRITICAL Puppet has 1 failures [13:01:16] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [13:01:35] ori: do you know if the lasbs project ‘quality-assurance’ is still in use? [13:03:56] (03PS1) 10Alexandros Kosiaris: puppetmaster: uuid-generator ensured absent [puppet] - 10https://gerrit.wikimedia.org/r/210038 [13:04:27] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [13:05:42] (03PS1) 10Dereckson: Add www.jacar.go.jp to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210039 (https://phabricator.wikimedia.org/T98733) [13:05:44] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: uuid-generator ensured absent [puppet] - 10https://gerrit.wikimedia.org/r/210038 (owner: 10Alexandros Kosiaris) [13:09:37] http://accidentallyquadratic.tumblr.com/post/118629661042/puppet-apply [13:10:44] (03PS1) 10Aude: Enable arbitrary access to Wikibase items on nlwiki and frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210040 (https://phabricator.wikimedia.org/T98238) [13:10:47] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [13:11:47] RECOVERY - puppet last run on palladium is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [13:12:48] RECOVERY - puppet last run on virt1000 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [13:14:57] RECOVERY - puppet last run on strontium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:17:29] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [13:20:47] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:29:27] yurik: Re SWAT, it's generally better to cherry-pick to the extension's wmf branch than to update it to master. [13:31:45] even after banging my head on git for a while now sometimes it still feels like http://git-man-page-generator.lokaltog.net [13:33:19] anomie, agree, i will remove the graph's ext - please deploy your core patches [13:33:29] anomie, i will do the depl separatelly [13:33:35] to make sure everything works ok [13:34:23] anomie, the reason for "to master" is because there was a number of formatting cleanup patches, thus no cherrypicking [13:35:33] Lots of "formatting cleanup" getting in the way of a clean cherry-pick is unfortunate. [13:36:55] true that :) [13:37:12] anomie, updated the schedule and added myself for an hour after [13:41:04] yurik: Also, please prepare the backports of the other changes you're requesting for SWAT. [13:43:45] (03CR) 10Paladox: Adding task support instead of using Bug: which was for bugzilla (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/209741 (owner: 10Paladox) [13:43:51] (03PS17) 10Paladox: Adding task support instead of using Bug: which was for bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/209741 [13:44:27] godog: That page is... uncanny. The difference between it and actual git documentation is low enough that the top banner is really needed. :-) [13:45:43] Coren: haha yes! a case of "it's funny because it's true" at its best [13:46:35] !log aude Synchronized php-1.26wmf5/extensions/Wikidata: Fix interaction with AbuseFilter (duration: 00m 19s) [13:46:41] Logged the message, Master [13:49:11] !log aude Synchronized php-1.26wmf4/extensions/Wikidata: Fix interaction with AbuseFilter (duration: 00m 20s) [13:49:14] Logged the message, Master [13:50:01] 6operations, 10ops-eqiad, 5Patch-For-Review: humidity sensors in eqiad row c/d showing alarms - https://phabricator.wikimedia.org/T98721#1276032 (10Cmjohnson) I am emailing equinix to get their environmental information. We may have our settings set to high. The temperatures are getting warmer and the humid... [13:50:48] (03CR) 10Paladox: Adding task support instead of using Bug: which was for bugzilla (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/209741 (owner: 10Paladox) [13:51:00] (03CR) 10Paladox: Adding task support instead of using Bug: which was for bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/209741 (owner: 10Paladox) [13:51:29] (03CR) 10Aude: [C: 032] Enable arbitrary access to Wikibase items on nlwiki and frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210040 (https://phabricator.wikimedia.org/T98238) (owner: 10Aude) [13:55:39] akosiaris: it should be updated now, right? [13:55:45] akosiaris: puppet on beta. [13:59:48] (03Merged) 10jenkins-bot: Enable arbitrary access to Wikibase items on nlwiki and frwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210040 (https://phabricator.wikimedia.org/T98238) (owner: 10Aude) [14:01:21] !log aude Synchronized wmf-config/InitialiseSettings.php: Enable arbitrary Wikibase access for nlwiki and frwikisource (duration: 00m 16s) [14:01:27] Logged the message, Master [14:09:34] kart_: probably [14:11:18] (03PS2) 10Filippo Giunchedi: graphite: mirror traffic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/208626 (https://phabricator.wikimedia.org/T85908) [14:12:05] akosiaris: I can't see new packages installed. [14:12:12] akosiaris: while puppet is updated. [14:12:27] On beta. [14:12:31] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: mirror traffic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/208626 (https://phabricator.wikimedia.org/T85908) (owner: 10Filippo Giunchedi) [14:14:15] kart_: puppet was updated on the puppetmaster but had not run on deployment-apertium01 [14:14:19] it now has, [14:14:30] eventual consistency ftw [14:18:15] Isn't it strange that Production is updated before Beta? :) [14:19:25] (03PS1) 10Aude: Add wb_changes_subscription and wbc_entity_usage tables [software] - 10https://gerrit.wikimedia.org/r/210057 [14:19:39] 6operations, 6Labs, 10hardware-requests: labnet1002 - https://phabricator.wikimedia.org/T98740#1276111 (10Andrew) 3NEW [14:19:44] kart_: you activated my spidey sense [14:19:55] :) [14:20:11] 6operations, 6Labs, 10hardware-requests: labnet1002 - https://phabricator.wikimedia.org/T98740#1276121 (10Andrew) [14:20:55] kart_: if it doesn't work on enwiki, it won't work anywhere else! [14:20:55] duhh [14:21:46] might as well test there first! [14:23:05] and besides, production works all of the time 99% of the time [14:23:05] ;D [14:23:26] kart_: yes it is. it is explained by this https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/puppetmaster/manifests/gitsync.pp;4f7617ede029172f27ef1bee069a24ca767e5b61$25 [14:23:32] greg-g: ^ [14:23:43] I am thinking we can lower this to around 10 [14:23:51] 1 hour is too much IMHO [14:24:15] (03CR) 10Paladox: "Hi how would I add the Task prefix like bugzilla Bug: prefix." [puppet] - 10https://gerrit.wikimedia.org/r/209741 (owner: 10Paladox) [14:24:53] akosiaris: /me nods [14:24:56] we can probably go way lower, like 1 but small steps [14:24:57] (03PS18) 10Paladox: Adding task support instead of using Bug: which was for bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/209741 [14:24:59] (03PS1) 10Aude: Enable Graph extension on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210058 [14:25:34] (03PS2) 10Aude: Enable Graph extension on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210058 (https://bugzilla.wikimedia.org/97993) [14:28:06] (03CR) 10Aude: "graph is enabled now on test.wikidata (default enabled everywhere except wikidatawiki -- production and labs)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208654 (https://phabricator.wikimedia.org/T97993) (owner: 10JanZerebecki) [14:30:13] (03PS1) 10Alexandros Kosiaris: beta: Update the puppetmaster more often [puppet] - 10https://gerrit.wikimedia.org/r/210059 [14:37:17] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [14:41:41] (03PS3) 10Aude: Enable Graph extension on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210058 (https://phabricator.wikimedia.org/T97993) [14:44:14] (03PS1) 10Filippo Giunchedi: gdash: move graphite eqiad to its own directory [puppet] - 10https://gerrit.wikimedia.org/r/210061 [14:44:16] (03PS1) 10Filippo Giunchedi: gdash: add graphite codfw [puppet] - 10https://gerrit.wikimedia.org/r/210062 [14:45:07] (03CR) 10jenkins-bot: [V: 04-1] gdash: add graphite codfw [puppet] - 10https://gerrit.wikimedia.org/r/210062 (owner: 10Filippo Giunchedi) [14:48:32] (03Abandoned) 10JanZerebecki: Enable Graph extension on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208654 (https://phabricator.wikimedia.org/T97993) (owner: 10JanZerebecki) [14:49:42] manybubbles, ^demon|busy, thcipriani, marktraceur: Who wants to SWAT this morning? [14:49:53] <^demon|busy> so so so not it [14:50:09] anomie: I don't see why I can't do it [14:50:14] reading the calendar [14:50:20] manybubbles: ok, it's yours! [14:51:07] bd808, yurik, and aude: ping to make sure you are around for swat. please reply in the next 10 minutes or I'll complain [14:51:16] silently, to myself. [14:51:33] * aude here [14:51:43] aude: thanks! [14:51:55] it's a trivial patch [14:52:06] bd808: this one depends on an outdated change. I'm going to try to rebase it on master [14:52:10] aude: great! [14:52:27] (03PS2) 10Manybubbles: Send MediaWiki events for all wikis to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209172 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [14:52:44] bd808: and that worked. yay [14:52:51] :) [14:53:35] bd808: you going to be able to verify no crashiest after you patch? [14:53:56] yurik has bounced out..... [14:54:04] if bad things happen they will show up in the hhvm.log on fluroine [14:55:10] how do I not have a terminal open.... [14:55:28] <^demon|hellabusy> slacker! [14:55:54] what is wrong with me [15:00:05] manybubbles, anomie, ^d, thcipriani, bd808, yurik, aude, Dereckson: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150511T1500). Please do the needful. [15:02:16] aude: first [15:02:18] (03CR) 10Manybubbles: [C: 032] Enable Graph extension on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210058 (https://phabricator.wikimedia.org/T97993) (owner: 10Aude) [15:02:24] (03Merged) 10jenkins-bot: Enable Graph extension on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210058 (https://phabricator.wikimedia.org/T97993) (owner: 10Aude) [15:02:29] (03PS1) 10devunt: Deploy Josa extension to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210069 (https://phabricator.wikimedia.org/T15712) [15:03:04] manybubbles: thanks [15:03:13] i'll have to wait until it propagates to beta [15:03:34] !log manybubbles Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: enable graph extension in beta. this should be a noop (duration: 00m 13s) [15:03:38] Logged the message, Master [15:04:37] bd808: your turn [15:04:45] (03CR) 10Manybubbles: [C: 032] Send MediaWiki events for all wikis to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209172 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [15:04:52] (03Merged) 10jenkins-bot: Send MediaWiki events for all wikis to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209172 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [15:05:17] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:05:52] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT: send all mediawiki events from all wikis to logstash (duration: 00m 12s) [15:05:58] Logged the message, Master [15:06:17] bd808: I don't see any more logs than usualy in fluorine [15:06:20] log volume just shot up on the kibana dashboard \o/ [15:06:33] I imagine [15:06:36] so success [15:06:41] looks like it [15:06:46] I'll watch for a bit [15:06:55] yurik: I've cherry picked your patches to wmf5 and wmf4 [15:07:44] next time it'd be nice if you did that, merged them, and had the submodule update for the next swat. you know, you did all the hard work. It'll take a few minutes for jenkins to chew threw them [15:08:15] manybubbles, sorry, yes, i was planning on doing it, but than electricians showed up :( [15:08:24] ah! [15:08:31] no internet for half an hour!!! [15:08:33] well its moving along [15:08:35] or even longer : [15:08:35] my god! [15:08:36] :) [15:08:40] i know! [15:08:42] are you back now? [15:08:44] horrible [15:08:45] or just going [15:08:45] yep [15:08:48] back [15:10:25] 6operations, 6Labs, 10Labs-Infrastructure: Migrate Labs NFS storage from RAID6 to RAID10 - https://phabricator.wikimedia.org/T96063#1276242 (10coren) The plan is to gradually evacuate the pv on individual raid arrays with pvmove, reconfigure the freed raid arrays with raid 10, and recreate pv on the new arra... [15:12:12] Hi. [15:13:56] yurik: I'm merging all the cherry picks now. once wmf5 is merged I'll ping you again and deploy it. if that goes well I'll deploy wmf4. If not I'll revert it. [15:14:11] manybubbles, thanks! [15:14:39] andrewbogott: I'm still getting that stupid session data error on wikitech :/ [15:14:42] (03PS1) 10Yurik: LABS: Enable wgGraphImgServiceAlways, cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210073 [15:14:43] manybubbles, could you also push this ^ -- its for my labs testing [15:14:54] andrewbogott: I just brute force my saves (keep hitting save) and it works, but, just FYI [15:15:09] manybubbles, i will add it to the deploy manifest page [15:15:11] yurik: if you stick it on wikitech [15:15:12] andrewbogott: same issue than greg-g [15:15:13] thanks [15:16:10] aude: do you know about https://gerrit.wikimedia.org/r/#/c/210073 from yurik? I just merged a patch for you with graphs, right? [15:16:14] (03CR) 10Manybubbles: [C: 032] LABS: Enable wgGraphImgServiceAlways, cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210073 (owner: 10Yurik) [15:17:27] manybubbles: i don't know [15:17:28] 6operations, 10Incident-20150205-SiteOutage, 10MediaWiki-Debug-Logging, 6Reading-Infrastructure-Team, and 2 others: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1276254 (10bd808) 5stalled>3Resolved All wikis are logging to Logstash again usi... [15:17:29] manybubbles, added * {{gerrit|210073}} LABS: enables wgGraphImgServiceAlways [15:17:51] aude and yurik, talk about latest patch? [15:18:04] manybubbles, it shouldn't affect wikidata - its off there [15:18:15] k [15:19:23] yurik: http://wikidata.beta.wmflabs.org/wiki/Wikidata:Sandbox :) [15:19:28] seems to work ok [15:19:38] aude, yei! [15:19:51] aude, you want me to enable it in the next hour? I have a window [15:19:55] up to you :) [15:19:57] yurik: on wednesday [15:20:00] oki :) [15:20:42] aude, http://graphoid-beta.wmflabs.org/wikidata.beta.wmflabs.org/v1/png/Wikidata%3ASandbox/135237/1533aaad45c733dcc7e07614b54cbae4119a6747.png [15:20:53] that's your graph rendered by the backend service [15:20:56] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [15:21:54] yurik: I just realized that I should scap with all those new message parameters [15:21:59] all the i18n [15:22:08] oh well ) [15:22:45] * yurik doesn't like scap... scarry [15:23:15] yurik: meh. scap's never failed me. [15:23:29] anyway, I'll just wait until wmf5 and wmf4 are merged for your changes and let scap take them [15:23:29] * yurik thinks there is a first time for everything [15:23:36] indeed [15:23:45] manybubbles, sure [15:24:02] if it fails, we can always revert the code while leaving msgs intact [15:24:16] (03CR) 10Manybubbles: [V: 032] LABS: Enable wgGraphImgServiceAlways, cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210073 (owner: 10Yurik) [15:24:30] manybubbles, could you push the settings change first? [15:24:37] its a noop for prod [15:24:39] yes - now [15:24:42] thx [15:25:31] !log manybubbles Synchronized wmf-config/InitialiseSettings-labs.php: SWAT cleanup wgGraphImgServiceAlways 1/3 (duration: 00m 12s) [15:25:34] Logged the message, Master [15:25:51] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT cleanup wgGraphImgServiceAlways 2/3 (duration: 00m 12s) [15:26:02] Logged the message, Master [15:26:14] !log manybubbles Synchronized wmf-config/CommonSettings.php: SWAT cleanup wgGraphImgServiceAlways 3/3 (duration: 00m 12s) [15:26:17] Logged the message, Master [15:26:29] yurik: and your config change is now live - please verify [15:26:43] *nod* [15:27:36] manybubbles: would it be possible you do mine after yurik ones? I need to join friends into town for eating in 30 minutes. [15:28:06] Dereckson: oh! you added those after I loaded the page. [15:28:25] sorry for the last minutes patches [15:28:30] reading [15:28:49] ori? [15:28:56] Steinsplitter: hey [15:29:32] ori: hi, you forgot to notify the user. i crated the page and deleted it to have a logentry. just fyi ;) [15:30:02] manybubbles, i wonder how long it takes for the labs to pick up a setting [15:30:17] Steinsplitter: thank you [15:31:08] yurik: 10 minutes I think is how frequent the deploys are but I could be wrong [15:32:09] ok! yurik and Dereckson it looks like I'm going to have to stop swating for now - see security channel for more [15:33:35] !log stopping SWAT due to some incident that just picked up. Right now Ib990f00ebe974008cea4dccbaa212ec20c846674 and Ida3fd5f8808202892001f66c4a534c1725e769a6 are merged awaiting a scap. [15:33:40] Logged the message, Master [15:36:57] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:38:50] AaronSchulz: ping! [15:39:55] AaronSchulz: https://gerrit.wikimedia.org/r/#/q/If892d77077607ffcaba0510355175a1e4d780ae9,n,z looks like it's waiting to merge into wmf4/5, and we're seeing the djvu load issues recurring at present [15:40:04] manybubbles, config change looks great on labs, thx [15:40:05] AaronSchulz: is this stuff good to go? can we merge it? will it help? [15:40:36] bblack, yes, it's fine [15:41:32] AaronSchulz: I see a related https://gerrit.wikimedia.org/r/#/c/209983/1 as well, with a -1 from jenkins which looks trivial (the output changed on things that that patch probably intentionally changes the output on) [15:41:54] bblack and AaronSchulz: I'm going to merge https://gerrit.wikimedia.org/r/#/q/If892d77077607ffcaba0510355175a1e4d780ae9,n,z and swat it then [15:42:05] manybubbles: sounds good to me, thanks! [15:42:49] Dereckson: it looks like I'm not going ot have time for your patches in this window [15:42:52] sorry! [15:42:57] on 209983, of course I have no idea, maybe the jenkins fail is real :) [15:43:49] np [15:44:23] bblack, probably real [15:44:47] doh [15:46:22] (03CR) 10Alex Monk: Enable Graph extension on beta wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210058 (https://phabricator.wikimedia.org/T97993) (owner: 10Aude) [15:47:33] 6operations, 6Labs, 10Labs-Infrastructure: Migrate Labs NFS storage from RAID6 to RAID10 - https://phabricator.wikimedia.org/T96063#1276362 (10coren) An alternative plan, based on input from @mark, that front loads the thin pool move to give performance improvement earlier. With a bit of extra juggling (bec... [15:48:32] 6operations, 7Graphite, 5Patch-For-Review: revisit what percentiles are calculated by statsite - https://phabricator.wikimedia.org/T88662#1276365 (10fgiunchedi) [15:48:36] 6operations, 7Graphite, 5Patch-For-Review: test sending varnishkafka and swift statsd traffic directly - https://phabricator.wikimedia.org/T95687#1276363 (10fgiunchedi) 5Open>3Resolved change merged, resolving [15:48:45] (03CR) 10Hoo man: Enable Graph extension on beta wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210058 (https://phabricator.wikimedia.org/T97993) (owner: 10Aude) [15:52:45] !log manybubbles Synchronized php-1.26wmf5/includes/media/DjVu.php: SWAT: 10 mb djvu files are expensive to thumbnail (wmf5) (duration: 00m 11s) [15:52:49] Logged the message, Master [15:53:17] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, and 3 others: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1276379 (10Anomie) >>! In T96360#1236196, @GWicke wrote: > It is not clear to me why the xml is loaded at all... [15:53:51] !log manybubbles Synchronized php-1.26wmf4/includes/media/DjVu.php: SWAT: 10 mb djvu files are expensive to thumbnail (wmf4) (duration: 00m 13s) [15:53:52] bblack and AaronSchulz: ^^^^^^ [15:53:55] Logged the message, Master [15:54:09] manybubbles: many thanks [15:54:22] someone know ooui? [15:54:29] seeing some annoying exceptions from it scrolling by [15:57:04] np. [15:57:21] sooooo bblack do we consider this resolved enough for us to SWAT yurik's swat changes? [15:57:26] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [15:57:36] also, yurik your window is coming and I had to not do the scap [15:57:57] so either I roll back or we scap or consider ourselves in a funky state [15:57:57] manybubbles: too early to say, need another few minutes [15:58:05] ori: k. [15:58:08] manybubbles, go ahead and scap before me [15:58:16] manybubbles, actually, hmm [15:58:23] * ori is watching http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=MySQL+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [15:58:25] you want to push my graph master? [15:58:52] nah, manybubbles, too complicated, go ahead and scap without the Graph ext master, only with the core changes [15:59:10] i will push my changes sep [15:59:37] !log waiting a few minutes after that last set of patches before we're sure that the load is down and then, hopefully, we'll scap to get the core changes that are already merged and sitting on tin that we had to ignore while we handled the trafic spike. [15:59:42] Logged the message, Master [15:59:43] (03CR) 10BBlack: [C: 031] Add vlogdump to varnish module [puppet] - 10https://gerrit.wikimedia.org/r/209999 (owner: 10Ori.livneh) [15:59:59] yurik: ^ see note from ori re waiting a few minutes [16:00:05] yurik: Respected human, time to deploy Graph extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150511T1600). Please do the needful. [16:00:13] greg-g, i'm not deploing until manybubbles is done scaping [16:00:17] and even then [16:00:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Add vlogdump to varnish module [puppet] - 10https://gerrit.wikimedia.org/r/209999 (owner: 10Ori.livneh) [16:00:36] yurik: gotcha, /me is only partially reading backscroll [16:00:39] ori: would fewer djvu huge djvu thumbnails help the mysql network stuff [16:01:11] greg-g: yeah - we had a thing. tin is still in a funky state. its ready for a scap that we didn't do [16:01:27] manybubbles: yep [16:01:30] stupid things [16:02:11] manybubbles: we think the djvu issue is the primary cause of the current mysql load issues, yes [16:02:13] manybubbles: the query storm is things like Query SELECT /* ForeignDBFile::loadExtraFromDB 66.249.69.221 */ img_metadata FROM `image` WHERE img_name = 'Revue_des_Deux_Mondes_-_1885_-_tome_69.djvu' AND img_timestamp = '20100802065150' LIMIT 1 [16:02:20] just don't know if that patch is enough to fix it all [16:02:43] Revue_des_Deux_Mondes is one of the classic djvu + googlebot things that has come up repeatedly for months now at times [16:02:54] ori: ah. and its a problem because building the thumbnails is so slow. if it was fast it wouldn't storm. [16:03:04] bblack: nah i just grepped my irc log for ForeignDBFile and found this particular query that you pasted [16:03:13] oh heh [16:03:48] I wouldn't call that graph "going down" [16:03:58] marktraceur: someone know ooui? seeing some annoying exceptions from it scrolling by [16:04:11] if you tilt your head a little, about 180 degrees, it kinda looks downish [16:04:53] manybubbles: What exceptions? [16:05:16] marktraceur: exception 'OOUI\Exception' with message 'Potentially unsafe 'href' attribute value. Scheme: ''; value: '/wiki/%D9%85%D9%84%D8%AD%D9%82:1675'.' in /srv/mediawiki/php-1.26wmf4/vendor/oojs/oojs-ui/php/Tag.php:317 [16:05:30] Hm. [16:05:49] That looks safe-ish to me? [16:06:08] I think matmarex told me there was a patch in progress for that error last week [16:06:10] Maybe OOUI is angry about the Unicode stuff [16:06:12] Oh, good. [16:06:16] marktraceur: dunno what its complaining about for sure. [16:06:33] https://phabricator.wikimedia.org/T94900 [16:06:49] ori: I'm seeing User::loadFromDatabase come up in the slow queries [16:07:07] 6operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T98726#1276403 (10Papaul) I will have the drive on site by tomorrow. R# 910990684 / DPS# 311633667 / Failed HDD slot 4 [16:07:12] bd808 and marktraceur: yeah - that one [16:09:04] (03PS1) 10Aude: Add dblist for wikidatausagetracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210080 [16:09:58] (03CR) 10Manybubbles: [C: 031] Add dblist for wikidatausagetracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210080 (owner: 10Aude) [16:10:05] cache misses on djvu thumbs don't seem to be coming in at all that high a rate, but maybe they don't have to if they're so expensive... [16:10:25] some are real browsers, some are bingbot/msnbot/googlebot [16:10:25] (03PS1) 10Aude: Add wbc_entity_usage table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210081 (https://phabricator.wikimedia.org/T98743) [16:10:41] e.g. [16:10:41] 19 RxURL c /wikipedia/commons/thumb/e/e9/Recenseamento_do_Brazil_(1920)_-_02.djvu/page372-508px-Recenseamento_do_Brazil_(1920)_-_02.djvu.jpg [16:10:44] 19 RxHeader c User-Agent: Googlebot-Image/1.0 [16:10:46] :) [16:12:44] (03CR) 10Aude: "untested but think this will work." [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210081 (https://phabricator.wikimedia.org/T98743) (owner: 10Aude) [16:13:10] (03CR) 10Aude: [C: 04-1] "-1 until https://gerrit.wikimedia.org/r/#/c/210080/ is merged" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210081 (https://phabricator.wikimedia.org/T98743) (owner: 10Aude) [16:15:37] manybubbles, I just fixed https://gerrit.wikimedia.org/r/#/c/209983/ [16:15:58] * AaronSchulz was testing with Alice in Wonderland.djvu in vagrant [16:16:19] heh, also it was missing the cache set() too ;) [16:17:35] gilles, I guess we will want https://gerrit.wikimedia.org/r/#/c/209982/ merged before enwiki btw [16:17:55] AaronSchulz: yes I'm looking at that one right now [16:18:21] (03PS2) 10Aude: Add wb_changes_subscription and wbc_entity_usage tables [software] - 10https://gerrit.wikimedia.org/r/210057 (https://phabricator.wikimedia.org/T98748) [16:19:45] gilles, did Cenarium email you too? [16:20:06] AaronSchulz: nope [16:20:28] I saw his comment ont he changeset [16:20:31] * AaronSchulz forwards [16:20:36] the previous one, that is [16:21:09] oh! [16:21:17] * manybubbles thinks http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=MySQL+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report is better now [16:21:31] ori: ^^ [16:21:32] ! [16:22:02] its not back to where it was two days ago but its better [16:22:26] oh, nice graph moves :) [16:22:32] hmmm what happened? [16:22:37] indeed [16:22:44] is that just delayed reaction to the merge earlier? [16:23:13] (would we expect such a dramatic change from isExpensiveToThumbnail?) [16:23:52] AaronSchulz: thanks, his comment on the previous changeset had more or less the same content [16:24:52] the sync was ~15:53, the graph dropoff hit the stats at ~16:14 [16:25:12] 20 minutes to have effect down at that layer might be reasonable, I have no idea [16:28:00] bblack: I dunno - I expect its more likely that whatever was causing the spike at the time is dropping off [16:28:11] I'd love to be able to do yurik's scap soon though [16:28:30] if I can't I'm just going to revert the staged changes so I don't have to be in "deploy" mode for two hours [16:29:52] well the previous spike started out at the same time-of-day and with a similar pattern/magnitude [16:30:04] but didn't come to a sharp drop off until ~19:00 [16:30:14] manybubbles: I'd say go ahead [16:30:46] <_joe_> hey, I was completely immersed in my work, but did anyone check for memcached errors? [16:30:57] <_joe_> that's a likely cause of increased mysql activity [16:32:10] _joe_: will look [16:32:21] ok then, yurik I'm going to run your scap [16:33:12] !log manybubbles Started scap: SWAT js config vargs changes [16:33:15] Logged the message, Master [16:34:05] (03CR) 10coren: "For (2), it's not immediately clear to me whether it's reasonable or not to support applying tc rules to more than one interface in a mani" [puppet] - 10https://gerrit.wikimedia.org/r/209558 (owner: 10coren) [16:34:46] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:34:51] is there a way to filter the ganglia graph for specific databases? (i assume not) [16:35:08] _joe_: 28891 memcached errors so far today. not many, really [16:35:18] <_joe_> nope [16:35:28] <_joe_> manybubbles: worth checking anyways, I think [16:35:34] yeah, thanks [16:36:23] i'm sure our stuff is totally unrelated, but want to keep an eye on what impact / if any our [16:36:33] usage tracking and arbitrary access features have [16:36:50] suppose i should send mail to ops about it [16:42:37] 6operations, 10ops-codfw, 5Patch-For-Review: Set up missing PDUs in codfw and eqiad - https://phabricator.wikimedia.org/T84416#1276504 (10Papaul) @ Faidon yes you are right all the rows have at least network equipment connected to the PDU's. But for the configuration Chris is the right person to answer that... [16:46:58] (03CR) 10Giuseppe Lavagetto: etcd: create puppet module (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) (owner: 10Giuseppe Lavagetto) [16:47:29] (03PS15) 10Giuseppe Lavagetto: etcd: create puppet module [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) [16:51:27] (03PS1) 10BryanDavis: logstash: Exclude jobrunner debug messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210086 (https://phabricator.wikimedia.org/T87521) [16:52:15] yurik: scap is _still_ going [16:52:27] manybubbles: is it stuck anywhere? [16:52:31] what's the status line? [16:52:36] greg-g: we just started late - still going [16:52:40] sync-common [16:52:50] just 1 left or? [16:53:09] there's a full batch at least. 80 ssh procs open on tin [16:53:16] cool [16:53:29] nvm then (you're safe for now, snapshot1004.......) [16:53:50] yeah - its just taking a while because I started late [16:53:54] so it feels like a long time [16:56:40] manybubbles: if it gets down to 1 or 2 and seems stuck yell out. We've been having battles with snapshot1004 getting stuck [16:57:11] it looks really healthy right now though [16:59:06] cool [16:59:25] greg-g and yurik: I suspect yurik will want an increase on his window because we ran over from the djvu issues [17:00:07] yurik: take a bit longer, things are open until 1pm pacific [17:00:18] greg-g, i was'nt doing anything [17:00:26] I mean, feel free to go over [17:00:26] will have to reschedule [17:00:29] to later today [17:00:30] ok [17:00:54] besides, I found another issue on betalabs that thedj was looking at [17:01:10] !log manybubbles scap aborted: SWAT js config vargs changes (duration: 27m 58s) [17:01:15] Logged the message, Master [17:01:23] wait, what? [17:01:30] !log manybubbles Started scap: SWAT js config vargs changes [17:01:37] damn it - I was _copying_ [17:01:56] starting over.... [17:02:00] you forgot the shift key [17:02:33] such a pain [17:02:37] well, it'll go [17:02:42] probably be faster this time [17:06:24] <_joe_> manybubbles: can I help you? with scap issues? [17:06:42] _joe_: I just ctrl-c-ed it because I'm stupid. I dunno about beta [17:06:44] wmflabs beta is dead [17:06:44] http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [17:07:02] how? [17:07:10] <_joe_> yurik: it's not [17:07:16] oh [17:07:16] yurik: its so alive [17:07:18] well it's disfigured [17:07:18] try purging [17:07:36] <_joe_> manybubbles: I don't care about beta, is prod allright? [17:07:36] * aude goes to look [17:07:36] it looks fine to me [17:07:46] oh, it's dead after purge heh [17:07:47] <_joe_> manybubbles: which hosts failed scap? [17:07:48] _joe_: yes. so far as I know. I'm checking [17:07:51] Request: GET http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page, from 127.0.0.1 via deployment-cache-text02 deployment-cache-text02 ([127.0.0.1]:3128), Varnish XID 1174402571 [17:07:55] .... [17:08:27] aude: there it goes - just died for me [17:08:29] prod is fine [17:08:43] * aude hopes it's unrelated to my patch [17:08:47] probably unrelated [17:09:04] ok - beta is flapping [17:09:06] logs time! [17:09:15] <_joe_> manybubbles: I'll take a look [17:09:45] re: beta, greg-g / thcipriani already looked i think [17:09:53] eh, not re that [17:10:02] maybe? [17:10:03] https://phabricator.wikimedia.org/T98754 [17:10:22] <_joe_> #012Fatal error: Object does not implement ArrayAccess in /srv/mediawiki/php-master/includes/filerepo/file/LocalFile.php on line 258 [17:10:27] <_joe_> this is what I see in beta [17:10:36] oh [17:10:48] _joe_: yeah, I pinged AaronSchulz about it [17:10:49] <_joe_> and yes, that phab ticket [17:11:02] <_joe_> ok [17:14:04] ok - so beta looks pretty not crashy now [17:15:02] well it's intermittent [17:15:15] I'm seeing at the varnish level, the backend fetch sometimes ending up with: [17:15:18] 13 FetchError c Junk after gzip data [17:15:39] which results in a 503 [17:16:22] bd808: stuck scap maybe? [17:16:25] oh, nope [17:16:26] !log manybubbles Finished scap: SWAT js config vargs changes (duration: 14m 55s) [17:16:26] unstuck [17:16:32] all done! [17:16:33] Logged the message, Master [17:16:40] yurik: verify your SWAT patches..... [17:16:52] manybubbles, scap is done? [17:16:58] yes, finally [17:17:07] varnishlog from 503 I induced by cachebusting with http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page?x=y : https://phabricator.wikimedia.org/P633 [17:17:22] 6operations, 7HHVM: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1276672 (10bd808) @joe is this upstream patch by any chance in the latest HHVM builds you have been testing? [17:18:52] ugh, beta varnishes are out of date on package version [17:19:08] 6operations, 7HHVM: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1276684 (10Joe) Nope of course, but I guess we can add it to our next build. How serious is this? [17:19:08] that's why. we deployed some gzip changes a while back, and we're not running the version with related gzip fixes [17:19:41] <_joe_> bd808: how serious is that? where did you see that error? [17:19:55] which is at least partly because they're still on precise, and we don't build varnish packages for precise anymore... [17:20:25] _joe_: anomie ran into it testing new auth system changes. [17:20:44] _joe_: "17:19 < AaronSchu> greg-g, fix incoming" [17:21:05] greg-g, what was the bug # again ? [17:21:06] _joe_: so it's not a problem today but is a problem with the AuthManager rewrite [17:21:19] <_joe_> bd808: to be more clear - I don't have any time to build a new package right now, so if it's a blocker, please put an end date on this :P [17:21:32] greg-g, nvm [17:22:23] thanks to thcipriani for noticing it first :) [17:23:37] _joe_: anomie and I will try to come up with a reasonable timeline. Not sooner than the end of the quarter pretty much for sure [17:23:48] <_joe_> ohhh ok [17:23:53] <_joe_> that may work [17:23:56] 6operations, 10Beta-Cluster, 10Deployment-Systems, 10Traffic: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1276698 (10BBlack) 3NEW [17:24:05] <_joe_> bd808: did you spot it locally with mediawiki-vagrant? [17:24:15] <_joe_> which version of hhvm are you running btw? [17:24:27] <_joe_> ah nevermind, I'll ask on the ticket [17:24:35] perfect [17:24:44] _joe_: It happens on mw1017 with /home/anomie/test2.php [17:24:58] 6operations, 7HHVM: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1276708 (10Joe) @bd808 @anomie where did you see this error? which version of hhvm? [17:25:12] <_joe_> anomie: would mind to try it on mw1050? [17:25:27] In a few minutes, sure [17:25:30] <_joe_> (different HHVMs today) [17:25:47] <_joe_> mw1017 has 3.6, mw1050 has 3.3 [17:26:15] greg-g, if noone else is deploying now, i could go now [17:27:06] yurik: sure [17:27:15] greg-g, ok, reclaiming the next depl spot [17:27:47] * manybubbles has logged off of tin. walking away from the keyboard for a bit. [17:27:55] tin is too exciting [17:28:16] it's a great MUD, isn't it? [17:28:53] * yurik is deploying graph ext... fun times ahead [17:33:02] greg-g: lol. I so loved MUDs [17:33:16] greg-g: and its got the pig! [17:33:32] 6operations, 10Beta-Cluster, 10Deployment-Systems, 10Traffic: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1276736 (10BBlack) Paste of induced 503s related to gzip: https://phabricator.wikimedia.org/P633 , where the fetching fails with: ``` 13 FetchError c Junk aft... [17:36:15] _joe_: https://gerrit.wikimedia.org/r/#/c/209995/ [17:37:38] <_joe_> ori: I would wait tomorrow to release to the whole cluster, btw [17:38:16] _joe_: let's wait for tomorrow for the package upgrade, but we can do the jit cache size thing now -- doubling it on 3.3 should not cause any harm, and it will make the package upgrade easier to do. [17:38:34] <_joe_> yeah I was about to say that [17:38:51] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/209995 (owner: 10Ori.livneh) [17:39:12] cool. i'll actually amend the patch to remove the override in hieradata for the canaries, since it is made redundant by this patch [17:39:44] <_joe_> yep [17:41:41] git gurus: if extension's wmf/nnn has all sorts cherrypicked patches, and it would be too compbersome to bring it in line with master, what's the easiest way to set wmf/nnn to the current master of that extension? [17:41:50] (03PS3) 10Ori.livneh: base_jit_size: 100 Mb -> 200 Mb [puppet] - 10https://gerrit.wikimedia.org/r/209995 [17:42:15] yurik: delete the branch in gerrit and recreate it [17:43:03] (03CR) 10Ori.livneh: [C: 032] base_jit_size: 100 Mb -> 200 Mb [puppet] - 10https://gerrit.wikimedia.org/r/209995 (owner: 10Ori.livneh) [17:44:21] thx ori [17:49:31] (03PS3) 10Gage: add deployer admin groups to codfw deploy server [puppet] - 10https://gerrit.wikimedia.org/r/209843 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [17:54:29] (03CR) 10Gage: [C: 032] add deployer admin groups to codfw deploy server [puppet] - 10https://gerrit.wikimedia.org/r/209843 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [17:55:19] /awa/away [17:56:47] legoktm, https://gerrit.wikimedia.org/r/#/c/209852/3 [17:57:46] AaronSchulz: it looks sane, but I don't feel comfortable merging that :/ [18:02:24] _joe_: Issue is present on mw1050 as well. [18:02:34] <_joe_> anomie: ok thanks [18:04:32] (03CR) 10Phuedx: "Removing the -2 now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209242 (https://phabricator.wikimedia.org/T95446) (owner: 10Phuedx) [18:13:56] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 1914 MB (1% inode=97%) [18:14:36] !log yurik Synchronized php-1.26wmf5/extensions/Graph: Bump Graph to master (duration: 00m 14s) [18:14:40] Logged the message, Master [18:15:40] !log yurik Synchronized php-1.26wmf4/extensions/Graph: Bump Graph to master (duration: 00m 11s) [18:15:44] Logged the message, Master [18:17:07] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [18:19:10] (03PS1) 10Merlijn van Deen: tools: make sure pyflakes is only included once [puppet] - 10https://gerrit.wikimedia.org/r/210107 [18:21:30] yuvipanda / Coren ^ [18:26:41] (03CR) 10coren: [C: 032] tools: make sure pyflakes is only included once [puppet] - 10https://gerrit.wikimedia.org/r/210107 (owner: 10Merlijn van Deen) [18:26:41] RECOVERY - Disk space on stat1002 is OK: DISK OK [18:26:44] (03PS1) 10Ori.livneh: add 'dist' to my (=ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/210108 [18:26:44] (03CR) 10Ori.livneh: [C: 032 V: 032] add 'dist' to my (=ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/210108 (owner: 10Ori.livneh) [18:29:38] (03PS1) 10Ori.livneh: Update location of vlogdump and add comment [puppet] - 10https://gerrit.wikimedia.org/r/210110 [18:29:52] (03CR) 10Ori.livneh: [C: 032 V: 032] Update location of vlogdump and add comment [puppet] - 10https://gerrit.wikimedia.org/r/210110 (owner: 10Ori.livneh) [18:30:36] (03PS1) 10Yurik: Cleaned Graph, enabled wmgGraphImgServiceAlways [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210111 [18:31:18] (03PS2) 10Ori.livneh: Revert "Set dedicated SUL rename runner loop" [puppet] - 10https://gerrit.wikimedia.org/r/207982 (owner: 10Aaron Schulz) [18:31:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Set dedicated SUL rename runner loop" [puppet] - 10https://gerrit.wikimedia.org/r/207982 (owner: 10Aaron Schulz) [18:31:34] need a second pair of eyes for https://gerrit.wikimedia.org/r/210111 [18:33:32] ori, could you take a look pls https://gerrit.wikimedia.org/r/#/c/210111 [18:34:20] 6operations, 7HHVM: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1276977 (10Anomie) Already discussed on IRC, but for posterity: Error occurs on both mw1017 (3.6.1) and mw1050 (3.3.1) using either the test script in the up... [18:35:07] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:35:58] (03CR) 10Yurik: [C: 032] Cleaned Graph, enabled wmgGraphImgServiceAlways [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210111 (owner: 10Yurik) [18:36:04] (03Merged) 10jenkins-bot: Cleaned Graph, enabled wmgGraphImgServiceAlways [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210111 (owner: 10Yurik) [18:36:35] yurik: sorry, don't have context [18:36:49] ori, just overall if you see anything funky [18:36:56] it should be fairly simple change [18:37:16] ori, moving a setting from commonsettings->initialisesetting [18:38:05] (03CR) 10Aaron Schulz: [C: 031] Increase jobrunner::runners_basic [puppet] - 10https://gerrit.wikimedia.org/r/209719 (https://phabricator.wikimedia.org/T98621) (owner: 10Nemo bis) [18:38:16] RECOVERY - Disk space on stat1002 is OK: DISK OK [18:38:21] ori, someone should look at https://gerrit.wikimedia.org/r/#/c/209719/ [18:38:41] ori, its ok, i will go ahead with it, nothing major there i hope [18:39:30] AaronSchulz: LGTM. If I merge, can you keep an eye on the job runners in Ganglia? [18:40:14] yes [18:40:17] thanks [18:40:26] (03PS6) 10Ori.livneh: Increase jobrunner::runners_basic [puppet] - 10https://gerrit.wikimedia.org/r/209719 (https://phabricator.wikimedia.org/T98621) (owner: 10Nemo bis) [18:40:35] (03PS3) 10coren: Creaet tc class analogous to ferm for traffic control [puppet] - 10https://gerrit.wikimedia.org/r/209558 [18:40:37] (03CR) 10Ori.livneh: [C: 032 V: 032] Increase jobrunner::runners_basic [puppet] - 10https://gerrit.wikimedia.org/r/209719 (https://phabricator.wikimedia.org/T98621) (owner: 10Nemo bis) [18:41:13] !log yurik Synchronized wmf-config: patch 210111 - Cleaned Graph, enabled wmgGraphImgServiceAlways (duration: 00m 13s) [18:41:19] Logged the message, Master [18:41:25] !log Deployed I4e3f42ea7, which increases jobrunner::runners_basic from 14 -> 20 [18:41:28] Logged the message, Master [18:41:39] (03CR) 10JanZerebecki: [C: 031] Add wbc_entity_usage table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210081 (https://phabricator.wikimedia.org/T98743) (owner: 10Aude) [18:41:41] Nemo_bis: thanks for that [18:42:39] gilles, hoping to get https://gerrit.wikimedia.org/r/#/c/209852/3 into swat today [18:42:50] (03CR) 10JanZerebecki: [C: 031] Add wb_changes_subscription table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210072 (https://phabricator.wikimedia.org/T98742) (owner: 10Aude) [18:43:16] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 2483 MB (2% inode=97%) [18:47:07] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1277014 (103gg5amp1e) [18:47:19] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1272696 (103gg5amp1e) "jobs": 27803968 [18:49:19] !log renamed a bunch more invalid usernames (https://phabricator.wikimedia.org/T5507) [18:49:23] Logged the message, Master [18:52:47] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3280 MB (3% inode=97%) [18:55:54] ori and AaronSchulz, thanks for looking into it :) [18:57:46] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 1566 MB (1% inode=97%) [18:57:50] (03CR) 10Aaron Schulz: [C: 031] logstash: Exclude jobrunner debug messages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210086 (https://phabricator.wikimedia.org/T87521) (owner: 10BryanDavis) [19:00:01] 10Ops-Access-Requests, 6operations: Grant ebernhardson shell account access to the elasticsearch cluster - https://phabricator.wikimedia.org/T98766#1277077 (10EBernhardson) 3NEW [19:02:02] 10Ops-Access-Requests, 6operations: Grant ebernhardson shell account access to the elasticsearch cluster - https://phabricator.wikimedia.org/T98766#1277093 (10Manybubbles) I'm currently tfink's delegate while he's out on paternity leave so I do I approve this or give it to Wes? [19:09:07] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3130 MB (3% inode=97%) [19:12:15] (03PS3) 10Aaron Schulz: Set $wgActivityUpdatesUseJobQueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206862 (https://phabricator.wikimedia.org/T91284) [19:12:18] RECOVERY - Disk space on stat1002 is OK: DISK OK [19:17:09] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 2042 MB (2% inode=97%) [19:19:40] ori, I wonder why http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=terbium.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2#metric_Global%20JobQueue%20length hasn't worked for a month or so [19:21:49] AaronSchulz: looks like it has a 10m ceiling? http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Miscellaneous+eqiad&h=terbium.eqiad.wmnet&jr=&js=&v=10915270&m=Global+JobQueue+length [19:29:10] AaronSchulz: dunno why. The metric command (/usr/local/bin/mwscript extensions/WikimediaMaintenance/getJobQueueLengths.php --totalonly | grep -oE '[0-9]+')) gets the right result [19:29:17] i'll try kicking ganglia-monitor on terbium [19:29:55] actually, it runs via cron, so that won't do any good [19:30:16] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 1291 MB (1% inode=97%) [19:33:28] (03CR) 10Dzahn: [C: 032] "confirmed identical when being sorted" [puppet] - 10https://gerrit.wikimedia.org/r/210025 (owner: 10Hashar) [19:35:14] !next [19:35:21] meh... how does that bot work? [19:35:23] jouncebot: next [19:35:23] In 0 hour(s) and 24 minute(s): Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150511T2000) [19:37:18] !log Updated Wikidata's property suggester with data from today's json dump [19:37:20] sjoerddebruin: ^ [19:37:24] Logged the message, Master [19:38:13] I’ve pointed out the stat1002 alert to researchers [19:38:15] they’re working on it [19:38:31] hoo: will test as usuals [19:38:47] (03PS2) 10Dzahn: Add python-stdeb to -dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/209969 (owner: 10Merlijn van Deen) [19:38:50] :) [19:39:13] Seems like https://www.wikidata.org/wiki/Property:P1412 grew a lot. [19:39:25] (03PS3) 10Yuvipanda: tools: add python-stdeb to -dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/209969 (owner: 10Merlijn van Deen) [19:39:35] Hm, nope. [19:39:38] valhallasw: ^ usually set the module name in the commit message [19:40:09] (03CR) 10Yuvipanda: [C: 032] tools: add python-stdeb to -dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/209969 (owner: 10Merlijn van Deen) [19:40:45] yuvipanda: *nod* [19:40:46] yuvipanda: eh [19:40:50] oh, right [19:41:11] I thought this was one of the ones I packaged, but it's not :D [19:41:19] yeah, I try to do that, but sometimes forget [19:43:18] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 1001 MB (0% inode=97%) [19:46:27] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 1229 MB (1% inode=97%) [19:47:18] hoo|away: No problems, but also not so big changes sadly. :) [19:54:04] 6operations: /tmp full on stat1002 - https://phabricator.wikimedia.org/T98773#1277238 (10Halfak) 3NEW [19:54:36] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 1737 MB (1% inode=97%) [19:57:57] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [19:59:27] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 1614 MB (1% inode=97%) [20:00:05] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150511T2000). Please do the needful. [20:02:47] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3129 MB (3% inode=98%) [20:05:57] RECOVERY - Disk space on stat1002 is OK: DISK OK [20:08:46] 6operations: /tmp full on stat1002 - https://phabricator.wikimedia.org/T98773#1277274 (10ori) The alert was annoying me so I took a look. I moved a few large files to /a/moved-from-tmp , and deleted a bunch of R session temp files that haven't been accessed or modified since November. There are 39G available now. [20:09:45] thanks ori [20:10:02] (03PS1) 10Dzahn: admin: adding shell account for jynus [puppet] - 10https://gerrit.wikimedia.org/r/210172 [20:10:40] ^ :-) [20:11:23] (03PS2) 10Dzahn: admin: adding shell account for jynus [puppet] - 10https://gerrit.wikimedia.org/r/210172 [20:12:17] (03CR) 10John F. Lewis: [C: 031] admin: adding shell account for jynus [puppet] - 10https://gerrit.wikimedia.org/r/210172 (owner: 10Dzahn) [20:13:32] gj JohnFLewis [20:13:50] Reedy: hush [20:13:55] (03CR) 10Yuvipanda: [C: 031] admin: adding shell account for jynus [puppet] - 10https://gerrit.wikimedia.org/r/210172 (owner: 10Dzahn) [20:14:00] gj yuvipanda [20:14:13] jynus: if you'd like to be dropped right into combat, please feel free to review https://gerrit.wikimedia.org/r/#/c/210017/ :) [20:14:17] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [20:14:19] Reedy: gj on electing the tories again :P [20:14:28] I couldn't vote! :( [20:14:37] welcome jynus :) [20:14:37] Reedy: I could :P [20:14:38] yuvipanda: offended :p [20:14:45] wut. [20:15:03] (03PS1) 10Andrew Bogott: Add support for OpenStack ceilometer [puppet] - 10https://gerrit.wikimedia.org/r/210175 [20:15:05] jynus: do you have a gerrit account already? [20:15:11] Reedy: on gerrit, I mean. very terrible joke. [20:15:16] mmm [20:15:32] I have some many new accounts [20:15:45] yes, I do, ori [20:15:54] jynus: FLUSH PRIVILEGES; [20:16:09] :-) [20:16:11] jynus: so many accounts but atleast they're connected somehow [20:16:17] just kidding, I saw http://dbahire.com/stop-using-flush-privileges/ [20:16:24] I supposed so :-) [20:17:19] so, what is exactly the workflow for reviewing? [20:18:12] check the diffs of the files changed [20:18:34] well, first, you want to quickly scan the patch and find a completely menial issue, like trailing whitespace, or something like that. then you -1 the patch, putting the ball back in the court of the patch author. [20:18:34] (see the "side-by-side" button in the diff column there somewhere, or the unified one if you prefer that) [20:18:34] jynus: so i made a ticket in phabricator for the onboarding, we can check things off the list one by one [20:18:57] usually the author will have fixed the issue you pointed out within a few seconds, but it is too late -- you moved on! [20:19:00] you can leave inline comments by double clicking a line and hitting save [20:19:07] then two months later get pinged by author about said patch [20:19:18] at which point you should feel properly guilty about letting it linger, and merge it without review [20:19:19] ah, so it is purely web based! [20:19:25] then to publish comments you hit the R key, and sometimes leave a more general message and sometimes set CR+1/CR-1/CR+2/CR-2 etc. [20:19:44] well [20:19:58] some things you should download and test locally [20:20:09] but the review system itself is web-based, yes [20:20:09] 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277296 (10Legoktm) [20:20:11] since you’re in the ops team it usually amounts to 1. upload patch, 2. ‘eh, nobody is going to want to review this, right?’ 3. hit merge, 4. submit 4 followup patches to fix the trivial issues you missed :D [20:20:13] You can send SSH commands to gerrit to do shizz [20:20:25] jynus: there's a CLI tool for retrieving changes, and it's even possible to submit review comments via the CLI, but there is no good interface for that [20:20:25] or hurry revert because $lol [20:20:36] JohnFLewis: https://www.mediawiki.org/wiki/Gerrit/Code_review [20:20:37] err [20:20:39] jynus: ^ [20:20:45] Krenair, yeah, of course [20:20:47] yuvipanda: :p [20:20:50] jynus, oh man it looks like you have so many accounts/access still to create [20:21:27] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277284 (10Dzahn) [20:21:32] QUICK LET’S OVERWHELM HIM UNTIL HE RUNS AWAY BWAHAHA! [20:21:41] mutante is helping me off channel [20:21:46] do not worry [20:21:52] :) [20:22:01] mutante: do ya'll have a page outlining needed onboarding stuffs? [20:22:20] zomg, cabal [20:22:28] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277321 (10Krenair) [20:23:01] greg-g: yes, i copied the ticket for Moritz which i copied from the list in RT [20:23:07] hah [20:23:11] so "page" [20:23:15] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277284 (10Krenair) [20:23:33] 10Ops-Access-Requests, 6operations: root shell for Jaime Crespo - https://phabricator.wikimedia.org/T98777#1277315 (10Krenair) [20:24:01] (03CR) 10Andrew Bogott: [C: 032] Add support for OpenStack ceilometer [puppet] - 10https://gerrit.wikimedia.org/r/210175 (owner: 10Andrew Bogott) [20:24:21] greg-g: well yea, but it was an actual ticket template [20:24:45] ahhhhh [20:25:19] then I should close here: https://phabricator.wikimedia.org/T98727 [20:25:32] because I didn't even know what I was asking [20:26:09] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277330 (10Krenair) [20:26:13] jynus: depends. ideally it should stay open as it is an access request for a phab group but ops don't have to follow the process (but they should ofc) [20:26:22] jynus: that's valid, we can use that as an example :) [20:26:35] ok [20:27:33] 6operations: Access for jcrespo to WMF-NDA group and operations project (?) - https://phabricator.wikimedia.org/T98727#1277336 (10Dzahn) [20:27:34] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277335 (10Dzahn) [20:27:49] (03PS1) 10Hashar: contint: packages for Android SDK [puppet] - 10https://gerrit.wikimedia.org/r/210177 (https://phabricator.wikimedia.org/T88494) [20:27:53] 6operations, 6Phabricator: Access for jcrespo to WMF-NDA group and operations project - https://phabricator.wikimedia.org/T98727#1277337 (10JohnLewis) p:5Triage>3Normal [20:29:13] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277284 (10Dzahn) LDAP (terbium): added to ops group [20:29:59] the other thing is that, as sean didn't want me to delete an en: table on my first day, he proposed me to work on https://phabricator.wikimedia.org/T92693 [20:30:18] 6operations, 6WMF-NDA-Requests: Need access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1277350 (10Krenair) #WMF-NDA-Requests is really supposed to be for volunteers to use... [20:31:17] I have some other misc. labswiki tasks for a db admin to look at [20:31:27] jynus: if you're going to work on it, switch the assignee to yourself so people don't bug Sean to get it done :) [20:32:05] (03PS3) 10Dzahn: admin: adding shell account for jynus [puppet] - 10https://gerrit.wikimedia.org/r/210172 (https://phabricator.wikimedia.org/T98777) [20:32:14] yes, JohnFLewis I will, I couldn't before- but let me finish with other tasks first! [20:32:40] jynus: just proding you into the stressful workflow you'll have ;) [20:32:40] (03CR) 10Dzahn: [C: 032] admin: adding shell account for jynus [puppet] - 10https://gerrit.wikimedia.org/r/210172 (https://phabricator.wikimedia.org/T98777) (owner: 10Dzahn) [20:33:21] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277367 (10Dzahn) [20:34:08] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277373 (10Krenair) [20:35:07] JohnFLewis: I don't think other days are going to be like this onboarding madness :) [20:35:31] Krenair: indeed but gotta prepare for the worst days! [20:36:14] (03PS1) 10Ori.livneh: Clean up R session temp files from /tmp on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/210183 [20:36:17] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277382 (10Dzahn) [20:36:21] yuvipanda: ^ [20:36:45] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277384 (10Krenair) [20:36:48] ori: we should get halfak to +1 that, I guess [20:36:54] halfak: ^ [20:36:57] (03CR) 10jenkins-bot: [V: 04-1] Clean up R session temp files from /tmp on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/210183 (owner: 10Ori.livneh) [20:37:01] grr [20:37:06] mutante, the ops ldap group is what controls access to operations/puppet +2, so I just ticked that as we know gerrit is OK [20:37:35] operations-puppet-tox-py27 ? really? [20:37:45] (03PS2) 10Ori.livneh: Clean up R session temp files from /tmp on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/210183 [20:37:51] rebased [20:37:55] ori: it's been failing quite a bit recently :/ [20:38:03] Krenair: ok, cool, yes i added to wmf and ops [20:38:10] (03PS1) 10Dzahn: admin: add jynus to ops [puppet] - 10https://gerrit.wikimedia.org/r/210184 (https://phabricator.wikimedia.org/T98777) [20:38:35] (03CR) 10Hashar: [C: 031 V: 032] "Deployed on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/210177 (https://phabricator.wikimedia.org/T88494) (owner: 10Hashar) [20:38:55] (03CR) 10jenkins-bot: [V: 04-1] admin: add jynus to ops [puppet] - 10https://gerrit.wikimedia.org/r/210184 (https://phabricator.wikimedia.org/T98777) (owner: 10Dzahn) [20:39:28] (03CR) 10John F. Lewis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/210184 (https://phabricator.wikimedia.org/T98777) (owner: 10Dzahn) [20:39:39] (03PS2) 10Ori.livneh: admin: add jynus to ops [puppet] - 10https://gerrit.wikimedia.org/r/210184 (https://phabricator.wikimedia.org/T98777) (owner: 10Dzahn) [20:39:47] (rebase) [20:39:57] ori: recheck should work though [20:40:09] yeah, i didn't see it in time [20:40:13] :) [20:40:25] probably look at making that test non voting as its failing too much [20:40:50] (03CR) 10John F. Lewis: [C: 031] admin: add jynus to ops [puppet] - 10https://gerrit.wikimedia.org/r/210184 (https://phabricator.wikimedia.org/T98777) (owner: 10Dzahn) [20:41:34] sjoerddebruin: Well, I guess that's good :) [20:41:44] No, people should work harder! [20:41:48] (03PS3) 10Dzahn: admin: add jynus to ops [puppet] - 10https://gerrit.wikimedia.org/r/210184 (https://phabricator.wikimedia.org/T98777) [20:41:59] (03CR) 10Dzahn: [C: 032] admin: add jynus to ops [puppet] - 10https://gerrit.wikimedia.org/r/210184 (https://phabricator.wikimedia.org/T98777) (owner: 10Dzahn) [20:42:12] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277401 (10jcrespo) [20:43:42] RoanKattouw, are you there? [20:45:25] We get another dba? :) [20:46:00] yes [20:46:14] a bad one, but it was the only one we had [20:46:29] jynus: sooo.. your account literally just got created on the bastion host [20:46:32] (03PS1) 10Ori.livneh: MediaWiki: clean up EasyTimeline and png files in /tmp [puppet] - 10https://gerrit.wikimedia.org/r/210186 [20:46:42] (03PS2) 10Ori.livneh: MediaWiki: clean up EasyTimeline and png files in /tmp [puppet] - 10https://gerrit.wikimedia.org/r/210186 [20:46:49] (03CR) 10Ori.livneh: [C: 032 V: 032] MediaWiki: clean up EasyTimeline and png files in /tmp [puppet] - 10https://gerrit.wikimedia.org/r/210186 (owner: 10Ori.livneh) [20:46:54] jynus: try iron.wikimedia.org [20:47:04] jynus: Just pretend to know what you're doing... they tend to not figure that... I've heard [20:48:28] I think I need someone who in the SWAT team [20:48:47] devunt: what's up? [20:49:07] !log Resolved T98695 by setting the email of the global account to the former enwiki email address. [20:49:10] Logged the message, Master [20:49:44] I have a new extension to deploy to the production cluster [20:49:55] what is it? [20:50:28] https://www.mediawiki.org/wiki/Extension:Josa [20:50:37] here's a changeset [20:50:37] https://gerrit.wikimedia.org/r/#/c/210069/ [20:51:02] hoo: 2015-05-11 19:46:53 mw1224 wikidatawiki exception INFO: [6f101a99] /w/api.php?fromrev=14099&action=compare&torev=14100&maxlag=5&format=json MWException from line 796 of /srv/mediawiki/php-1.26wmf4/includes/diff/DifferenceEngine.php: Diff not implemented for Wikibase\ItemContent; override generateContentDiffBody to fix this. [20:51:17] devunt: looking [20:51:18] It has finished its own test on beta cluster for a while [20:53:04] AaronSchulz: lots of stuff in exception.log about CAS updates failing [20:54:30] ori, according to the gerrit, and I didn't realise until now, but it seems you accepted my changeset before [20:54:33] ref: https://gerrit.wikimedia.org/r/#/c/203627/ [20:55:01] devunt: yeah, that was to add it to beta [20:55:11] yes it was. [20:55:12] https://gerrit.wikimedia.org/r/#/c/210069/ is the one for prod [20:55:18] it looks ok, just taking a quick look [20:56:03] (03PS2) 10Ori.livneh: Deploy Josa extension to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210069 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [20:56:17] (03CR) 10Ori.livneh: [C: 032] Deploy Josa extension to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210069 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [20:56:23] (03Merged) 10jenkins-bot: Deploy Josa extension to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210069 (https://phabricator.wikimedia.org/T15712) (owner: 10devunt) [20:57:07] 6operations, 6WMF-NDA-Requests: Need access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1277425 (10ZhouZ) Hi @Krenair, what process should Foundation employees use to request access? [20:57:46] devunt: it breaks on the main page of kowiki [20:57:58] i deployed it to a test host [20:58:04] what's the error? [20:58:11] https://dpaste.de/GpxX/raw [20:58:11] do you have to build the localization cache first? [20:58:18] oh, duh. [20:58:33] gj or [20:58:35] i [20:58:39] hurr durr [20:58:44] add to extension-list, run scap, then enable? [20:58:48] thanks everybody for the help on getting me onboard! [21:00:00] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: root shell for Jaime Crespo - https://phabricator.wikimedia.org/T98777#1277434 (10Dzahn) 5Open>3Resolved a:3Dzahn user created on iron May 11 20:55:11 iron sshd[27285]: Accepted publickey for jynus ... ``` root@iron:/etc/sudoers.d# id jynus uid=... [21:00:01] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277437 (10Dzahn) [21:00:25] !log ori Started scap: I45c1c76d4: Deploy Josa extension to production (but not enabling yet) [21:00:36] Logged the message, Master [21:01:46] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277443 (10Krenair) [21:02:51] (03PS1) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [21:04:59] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277452 (10Dzahn) [21:05:32] (03PS2) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [21:06:45] 6operations, 6WMF-NDA-Requests: Need access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1277473 (10Krenair) To be honest, I'm not sure we have one... Maybe we should change #WMF-NDA-Requests a bit. [21:07:29] I was very scared when ori tolds me I broke the main page of kowiki [21:07:55] (03CR) 10Ori.livneh: [WIP] Initial commit (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (owner: 10Yuvipanda) [21:07:57] I thought it was on the production and it was my fault [21:08:20] (03CR) 10Merlijn van Deen: "After reading the puppet docs, I *think* I get inheritance, and I *think* it does what we want (see added comment). we can then remove $is" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [21:08:33] devunt: it wasn't, and it wasn't :) [21:08:34] ori: TIL [21:08:57] devunt, I wouldn't be scared [21:09:09] devunt, if you broke production wikipedia you can get a t-shirt IIRC [21:09:21] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277484 (10Dzahn) [21:09:27] :D [21:09:32] does that also count for wikisource? Because then I want my t-shirt :-p [21:09:50] wow that sounds great [21:09:57] I had 30 shirts. I think there is one left [21:10:18] "I broke wikipedia and all I got was this lousy t-shirt"? [21:10:35] yeah there is one left [21:10:52] valhallasw: men's large [21:10:53] valhallasw: close -- https://twitter.com/bd808/status/511661882634407937 [21:10:54] does that work? [21:11:27] 6operations, 6WMF-NDA-Requests: Need access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1277492 (10Qgil) I followed the description at #wmf-nda and I added #operations because, indeed, the regular process signing an NDA online is for volunteers, not for employees. [21:11:28] ori: haha. I didn't fix it, though, someone awesome from ops did :( [21:11:46] i think you qualify [21:11:59] :D Yeah, I would like one <3 [21:12:08] oh it's real t-shirts [21:12:10] bd808: ok by you? [21:12:14] I like that t-shirts [21:12:25] devunt: you're too late, that was the last one :( [21:12:33] ori: yeah +2 to giving the last one to valhallasw [21:12:42] ori: valhallasw has root on tools now, I bet that’s not going to be the last time he breaks stuff :) [21:12:47] 6operations, 6WMF-NDA-Requests: Access for jcrespo to WMF-NDA group and operations project - https://phabricator.wikimedia.org/T98727#1277497 (10Qgil) [21:12:54] valhallasw: email me your mailing address? (ori@wikimedia.org) [21:13:08] that's a shame [21:13:23] ori: might be easier to ship via yuvipanda (or someone else who'll be at the lyon hackathon) [21:13:32] oh, i will be [21:13:36] :D [21:13:41] even better! [21:13:56] how do i make sure i don't forget..hmm [21:14:09] ori: put it with your passport [21:14:11] OH WAIT :P [21:14:12] yuvipanda: how are you with remembering things? [21:14:24] I can't remember. [21:14:28] ori: I’m ok. I precommit to help, and having only one suitcase of belongings helps :D [21:14:38] * ori brings t-shirt over to yuvi's desk [21:14:39] for example: I already have my bags packged :P [21:14:50] ori: Put it with my UK power adaptor you borrowed in London when you find a place that you will remember ;) [21:16:12] bd808: I HAVE THAT IN MY BAG [21:16:19] fuck. I meant to give it to you when you were here. [21:16:28] hahah [21:16:30] :D [21:16:33] heh [21:16:40] I still have my UK adaptor from my post-Zurich trip [21:16:58] bd808: are you going to lyon? [21:17:01] i'll have it with me if so [21:17:02] * valhallasw hands around banana plug cables [21:17:16] ori: yup. amybe one of us will remember there [21:18:44] I really want to go lyon hackathon while I'm in Europe but I don't have time to attend three days fully :/ [21:19:27] What are you planning to visit in Europe? [21:19:58] see you tomorrow, bye! [21:20:17] short term studying [21:20:45] 6operations, 6WMF-NDA-Requests: Access for jcrespo to WMF-NDA group and operations project - https://phabricator.wikimedia.org/T98727#1277553 (10Dzahn) added to Operations members on https://phabricator.wikimedia.org/project/members/29/ [21:20:55] (03PS3) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [21:21:42] I'm in Europe since late march :) [21:26:17] ori, any progress on deployment? [21:26:29] Ah :-) I'm off to bed, so good night (/have a good afternoon) to you all [21:27:16] gute nacht [21:35:18] PROBLEM - Apache HTTP on mw1110 is CRITICAL - Socket timeout after 10 seconds [21:35:20] (03PS4) 10Jdlrobson: Enable Gather on WikiVoyage and Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) [21:35:37] PROBLEM - HHVM rendering on mw1110 is CRITICAL - Socket timeout after 10 seconds [21:36:48] 6operations: Need WMF-NDA group access for Zhou Zhou (Legal Counsel, WMF) - https://phabricator.wikimedia.org/T98787#1277628 (10ZhouZ) 3NEW [21:39:13] (03CR) 10Dzahn: [C: 032] package_builder: Add lintian [puppet] - 10https://gerrit.wikimedia.org/r/209434 (owner: 10Alexandros Kosiaris) [21:40:03] (03CR) 10Dzahn: [C: 032] puppetmaster: remove extraneous empty line [puppet] - 10https://gerrit.wikimedia.org/r/209264 (owner: 10Alexandros Kosiaris) [21:41:12] (03CR) 10Dzahn: [C: 032] puppetmaster: cleanups in gitsync [puppet] - 10https://gerrit.wikimedia.org/r/209263 (owner: 10Alexandros Kosiaris) [21:42:36] (03CR) 10Dzahn: [C: 032] puppetmaster: Move system::role to the role class [puppet] - 10https://gerrit.wikimedia.org/r/209260 (owner: 10Alexandros Kosiaris) [21:43:56] !log Restarting HHVM on mw1110; threads stuck on HPHP::StatCache::refresh [21:44:00] Logged the message, Master [21:44:57] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.077 second response time [21:45:08] PROBLEM - HHVM busy threads on mw1110 is CRITICAL 33.33% of data above the critical threshold [86.4] [21:45:08] RECOVERY - HHVM rendering on mw1110 is OK: HTTP OK: HTTP/1.1 200 OK - 71004 bytes in 0.199 second response time [21:46:06] (03CR) 10Dzahn: [C: 031] Enable Gather on WikiVoyage and Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [21:46:21] devunt: allllmost there [21:47:19] !log ori Finished scap: I45c1c76d4: Deploy Josa extension to production (but not enabling yet) (duration: 46m 54s) [21:47:25] Logged the message, Master [21:47:46] geebus [21:48:34] 46 minutes [21:48:41] !log ori Synchronized wmf-config/InitialiseSettings.php: I45c1c76d4: Deploy Josa extension to production (enabling) (duration: 00m 13s) [21:48:44] Logged the message, Master [21:49:08] what kind of things are usually done in scap? [21:49:26] it takes _very_ long time [21:49:30] rebuilding the localisation cache (.cdb files) from .json files [21:50:14] devunt: do things look ok on kowiki? I don't see any errors, but I can't tell if the extension is working properly, since I don't know Korean :) [21:50:28] whole core and extensions stuffs? [21:50:37] I'll check it now [21:51:11] (03PS1) 10BryanDavis: beta: Add common wikitag for all beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210203 (https://phabricator.wikimedia.org/T98772) [21:51:27] RECOVERY - HHVM busy threads on mw1110 is OK Less than 30.00% above the threshold [57.6] [21:52:21] ori, I can't find entry "Josa" in version page [21:53:08] PROBLEM - Apache HTTP on mw1036 is CRITICAL - Socket timeout after 10 seconds [21:53:11] is it enabled properly? [21:53:28] PROBLEM - HHVM rendering on mw1036 is CRITICAL - Socket timeout after 10 seconds [21:53:50] devunt: try now [21:53:53] * ori looks into mw1036 [21:54:19] !log Restarting HHVM on mw1036; threads stuck on HPHP::StatCache::refresh [21:54:25] Logged the message, Master [21:54:37] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.176 second response time [21:54:54] ori, legoktm, ^demon|hellabusy -- can I get a review of https://gerrit.wikimedia.org/r/210203 for spooky config magic sanity? [21:54:56] seems that whenever we touch a bunch of files HHVM's StatCache freaks out [21:54:57] RECOVERY - HHVM rendering on mw1036 is OK: HTTP OK: HTTP/1.1 200 OK - 70996 bytes in 0.171 second response time [21:55:38] I wonder if it's really the kernel freaking about about a ton of inotify events? [21:55:44] (03CR) 10Ori.livneh: [C: 031] beta: Add common wikitag for all beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210203 (https://phabricator.wikimedia.org/T98772) (owner: 10BryanDavis) [21:56:01] Or StatCache not handling a flood of events well maybe [21:56:28] yeah, I can't imagine it's a kernel bug, we're not touching _that_ many files [21:56:35] (03CR) 10Legoktm: [C: 031] beta: Add common wikitag for all beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210203 (https://phabricator.wikimedia.org/T98772) (owner: 10BryanDavis) [21:56:35] inotify would be a little bit sad if that was all it took to overwhelm it [21:57:08] devunt: congrats :) [21:57:19] thank you [21:57:24] I think it works well [21:57:30] :) [21:57:51] getting an extension deployed to WMF prod is a pretty cool thing to have accomplished [21:58:40] ori: Am I clear to merge and sync that beta config change? [21:58:51] and it was a 7 years old bug [21:58:56] bd808, yeah [21:59:11] creation date of T15712 is Apr 12 2008, 10:37 PM [21:59:21] could someone add me to https://phabricator.wikimedia.org/project/profile/13/? [22:00:25] (03CR) 10BryanDavis: [C: 032] beta: Add common wikitag for all beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210203 (https://phabricator.wikimedia.org/T98772) (owner: 10BryanDavis) [22:00:33] (03Merged) 10jenkins-bot: beta: Add common wikitag for all beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210203 (https://phabricator.wikimedia.org/T98772) (owner: 10BryanDavis) [22:01:09] Negative24: {{done}} [22:01:26] (03CR) 10Dzahn: [C: 032] Deploy hotfixes for phabricator sprint extension bugs: [puppet] - 10https://gerrit.wikimedia.org/r/209847 (https://phabricator.wikimedia.org/T98464) (owner: 1020after4) [22:01:27] ori: thanks [22:01:31] !log bd808 Synchronized wmf-config/InitialiseSettings-labs.php: Add common wikitag for all beta cluster wikis (duration: 00m 12s) [22:01:49] Logged the message, Master [22:03:20] is Josa first wmf-deployed extension that aims specific language? [22:03:25] I'm just curious [22:04:45] not sure [22:04:58] Nemo_bis or Nikerabbit would know, most likely [22:05:09] or kart_ [22:05:34] !log removed /var/run/phab_repo_lock_libext_Sprint on iridium to allow sprint repo sync [22:05:37] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [22:05:40] Logged the message, Master [22:10:36] PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail [22:15:33] (03PS1) 10Andrew Bogott: Install ceilometer on virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/210210 [22:16:43] andrewbogott: so all hosts are on trusty now? does this mean we can upgrade? [22:16:47] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:16:58] yuvipanda: just the compute nodes. [22:17:50] andrewbogott: oh, so does upgrading require moving labnet / virt1000 too? [22:17:50] https://phabricator.wikimedia.org/T90823 and https://phabricator.wikimedia.org/T90824 remain [22:18:13] yeah +1 on doing that on new hardware [22:18:16] (virt1000 [22:18:18] (03CR) 10Andrew Bogott: [C: 032] Install ceilometer on virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/210210 (owner: 10Andrew Bogott) [22:20:08] Krenair: shall I put the Wikitech OAuth patch on SWAT? [22:20:39] Krenair: or should it have its own window? [22:21:07] yuvipanda, I'm not sure. [22:22:15] If you think it should just work and andrewbogott agrees, you can try putting it on swat. I doubt it'd cause anything bad that isn't trivially revert-able and limited to wikitech [22:22:36] I haven’t read it and this is all news to me [22:23:12] (03PS1) 10Andrew Bogott: Revert "Install ceilometer on virt1000" [puppet] - 10https://gerrit.wikimedia.org/r/210214 [22:23:14] andrewbogott: we enabled OAuth on wikitech a long time ago, and then it was forgotten in the move to silver. [22:23:17] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [22:23:20] bblack, is graphoid using varnishe? https://phabricator.wikimedia.org/P635 [22:23:21] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277801 (10Dzahn) [22:23:23] 6operations, 6WMF-NDA-Requests: Access for jcrespo to WMF-NDA group and operations project - https://phabricator.wikimedia.org/T98727#1277798 (10Dzahn) 5Open>3Resolved a:3Dzahn also added to WMF-NDA in https://phabricator.wikimedia.org/project/sprint/members/61/ [22:23:27] greg-g, are you around? what do you think? [22:23:49] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277284 (10Dzahn) [22:23:54] greg-g: re: re-enabling OAuth on wikitech, can it ride SWAT or should it have its own window? [22:24:41] There haven't been any schema changes to oauth stuff since it was moved, right? [22:24:58] oh there could have been. [22:25:06] I’d suggest just dropping the current tables and setting it up from scratch? [22:25:12] (03CR) 10Andrew Bogott: [C: 032] Revert "Install ceilometer on virt1000" [puppet] - 10https://gerrit.wikimedia.org/r/210214 (owner: 10Andrew Bogott) [22:25:17] PROBLEM - puppet last run on virt1000 is CRITICAL Puppet has 3 failures [22:26:21] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277804 (10Dzahn) a:3Dzahn [22:26:38] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [22:26:47] RECOVERY - puppet last run on virt1000 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [22:26:47] yuvipanda, ... yeah in that case I would ask greg-g for something more official rather than requesting it goes out in a swat deploy [22:27:01] Krenair: yeah, makes sense. [22:27:32] might be better to rename the tables [22:27:34] Krenair: can you help with the deploy? :) I’ve no idea how to deploy. [22:27:46] yeah, rename to _old and then forget about them forever :D (that’s what we do ;)) [22:27:47] (03CR) 10Dzahn: [C: 032] Add mira to deployment network rule [puppet] - 10https://gerrit.wikimedia.org/r/209875 (https://phabricator.wikimedia.org/T95436) (owner: 10John F. Lewis) [22:28:43] yuvipanda, whoever does the swat deployments in half an hour will be able to do it [22:28:58] only weird part is logging into silver.wikimedia.org to mess with the labswiki database [22:29:25] Krenair: hmm, I don’t see greg-g here so maybe not today [22:30:41] look at the calendar [22:31:45] (03PS1) 10Andrew Bogott: Move virt1001 to Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/210217 [22:31:48] andrewbogott, yuvipanda: speaking of silver, I noticed it has php 5.5.9 instead of 5.3.10? [22:31:49] Krenair: no, I meant, greg-g isn’t here atm and so I dunno if I can get a window after SWAT :) [22:32:39] some other hosts still have 5.3 :( [22:33:08] (03PS2) 10Andrew Bogott: Move virt1001 to Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/210217 [22:33:12] ori, I'm afraid I found the a little problem on my extension [22:33:28] devunt: do you know how to fix it? [22:33:43] yes I already submitted a patch to gerrit [22:33:46] not merged yet [22:33:53] andrewbogott: ^ is this for the ciscos? [22:33:54] what's the patch? [22:34:03] https://gerrit.wikimedia.org/r/#/c/210216/ [22:34:11] yuvipanda: I’m just setting up a test case. [22:34:17] andrewbogott: ah sweet. ok [22:34:27] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:34:35] want to experiment with switching the network node on the fly. [22:34:41] yuvipanda: what ya need? [22:34:43] ah, right [22:34:53] (03CR) 10Andrew Bogott: [C: 032] Move virt1001 to Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/210217 (owner: 10Andrew Bogott) [22:34:56] greg-g: extension re-deployment on wikitech, so I need a window *and* a deployer :P [22:35:44] yuvipanda: if you can find a deployer, you can take a window [22:35:58] alright, https://phabricator.wikimedia.org/T98567 is the task. [22:36:11] greg-g: ok, I’ll ask around. thanks [22:41:56] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:44:42] ori, if you don't mind, can you +2 on these changesets too? [22:44:43] https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/Josa+branch:master+topic:fix-style,n,z [22:44:51] it's just style changes [22:45:48] just whitespaces, whitespaces, and whitespaces [22:46:07] RECOVERY - Host virt1001 is UPING OK - Packet loss = 0%, RTA = 2.30 ms [22:46:24] oh, and some documentations :p [22:46:28] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277883 (10Dzahn) p:5Triage>3High [22:46:44] 10Ops-Access-Requests, 6operations: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277284 (10Dzahn) [22:47:36] (03PS4) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [22:49:26] PROBLEM - RAID on virt1001 is CRITICAL: Connection refused by host [22:49:27] PROBLEM - nova-compute process on virt1001 is CRITICAL: Connection refused by host [22:49:37] PROBLEM - DPKG on virt1001 is CRITICAL: Connection refused by host [22:50:16] PROBLEM - salt-minion processes on virt1001 is CRITICAL: Connection refused by host [22:50:26] PROBLEM - configured eth on virt1001 is CRITICAL: Connection refused by host [22:50:26] PROBLEM - puppet last run on virt1001 is CRITICAL: Connection refused by host [22:50:37] PROBLEM - Disk space on virt1001 is CRITICAL: Connection refused by host [22:50:48] PROBLEM - dhclient process on virt1001 is CRITICAL: Connection refused by host [22:54:42] !log ori Synchronized php-1.26wmf5/extensions/Josa: a0b561da25: Update Josa for cherry-picks (duration: 00m 11s) [22:54:49] Logged the message, Master [22:55:03] !log ori Synchronized php-1.26wmf4/extensions/Josa: dd2db67d9b: Update Josa for cherry-picks (duration: 00m 13s) [22:55:06] Logged the message, Master [22:55:07] devunt: ^ [22:55:11] ori, it fixed [22:55:16] thank you very much [22:55:21] no problem [22:57:02] Can someone add one additional item for SWAT for me (also Flow), "Fix metadataonly parameter and use it in JS"? [22:57:13] https://gerrit.wikimedia.org/r/#q,I97e194e2e1ed0ec5a115a847119fd3d8e7e6bbde,n,z [22:57:22] I can't do it because I have two-factor set up, and my phone is in the shop. [22:57:28] I'm doing the bumps now. [22:57:55] ori, is only condition fix applied to production? or with style changes? [22:57:56] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:58:56] Hi. [23:00:04] RoanKattouw, ^d, matt_flaschen, AaronSchulz, ebernhardson, kaldari, Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150511T2300). Please do the needful. [23:00:32] matt_flaschen: added [23:00:36] Thanks, Dereckson [23:00:40] It seems that Zuul may have stalled, possibly because I forced through https://gerrit.wikimedia.org/r/#/c/210227/. [23:00:43] https://integration.wikimedia.org/zuul/ [23:01:10] Forcing one tends to reset the status of all the others in the queue in my experience, not completely stall it [23:01:16] Looks like it started progressing again. [23:02:17] RECOVERY - Host virt1001 is UPING OK - Packet loss = 0%, RTA = 0.43 ms [23:02:18] Dereckson, wmf4 too please: https://gerrit.wikimedia.org/r/210227 [23:02:42] (03CR) 10Halfak: [C: 031] Clean up R session temp files from /tmp on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/210183 (owner: 10Ori.livneh) [23:03:20] matt_flaschen: added [23:03:25] Thanks [23:03:38] (but we are now at 9 patches) [23:03:43] (03PS3) 10Yuvipanda: Clean up R session temp files from /tmp on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/210183 (owner: 10Ori.livneh) [23:03:58] (03CR) 10Yuvipanda: [C: 032] Clean up R session temp files from /tmp on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/210183 (owner: 10Ori.livneh) [23:05:09] Yeah, sorry, but two are the same thing just on different branches, and most are config. [23:06:43] (03CR) 10Tim Landscheidt: Tools: Simplify and fix mail setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [23:08:15] Bumps: [23:08:24] 1.26wmf4 - https://gerrit.wikimedia.org/r/#/c/210238/ [23:08:29] 1.26wmf5 - https://gerrit.wikimedia.org/r/#/c/210239/ [23:11:04] 6operations, 6Labs, 10wikitech.wikimedia.org: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1277913 (10Krenair) [23:12:58] Is someone doing SWAT, or am I volunteering? :) [23:13:31] matt_flaschen: it looks like you are todays winner! [23:13:43] !log manually renamed and migrated User:~~@nlwiki --> User:~~-~nlwiki@global (T98155) [23:13:46] Logged the message, Master [23:14:16] Alrighty [23:15:05] legoktm: that's an username [23:15:09] (03CR) 10Mattflaschen: [C: 032] Flow should use VE by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209042 (https://phabricator.wikimedia.org/T98168) (owner: 10Mattflaschen) [23:15:26] (03Merged) 10jenkins-bot: Flow should use VE by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209042 (https://phabricator.wikimedia.org/T98168) (owner: 10Mattflaschen) [23:15:26] Dereckson: haha yup :P [23:15:45] Dereckson, yeah, I was thinking it's very magnanimous to allow them to keep the tilde. [23:16:57] there were a few users who had "~~~~" as their username [23:17:04] they're now "Invalid username ####" [23:17:55] !log mattflaschen Synchronized wmf-config/CommonSettings.php: Make VE default editor for Flow (duration: 00m 13s) [23:18:00] Logged the message, Master [23:29:26] (03CR) 10Kaldari: [C: 032] Import lists for the Browse experiment on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209242 (https://phabricator.wikimedia.org/T95446) (owner: 10Phuedx) [23:29:34] (03Merged) 10jenkins-bot: Import lists for the Browse experiment on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209242 (https://phabricator.wikimedia.org/T95446) (owner: 10Phuedx) [23:29:57] (03PS1) 10Aaron Schulz: Bumped the $wgJobBackoffThrottling refreshLinks limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210246 [23:30:01] (03CR) 10jenkins-bot: [V: 04-1] Bumped the $wgJobBackoffThrottling refreshLinks limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210246 (owner: 10Aaron Schulz) [23:31:25] (03PS2) 10Aaron Schulz: Bumped the $wgJobBackoffThrottling refreshLinks limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210246 [23:31:46] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [23:31:53] !log mattflaschen Synchronized php-1.26wmf4/extensions/Flow/: Deploy Flow metadataonly fix (duration: 00m 13s) [23:32:00] Logged the message, Master [23:32:22] ^ don’t worry about virt*** alerts, folks! those aren’t live labs machines [23:32:39] !log andrewbogott_afk playing around with upgrading virt*** boxes, which are non-live labs boxen. [23:32:47] Logged the message, Master [23:34:18] RECOVERY - Host virt1001 is UPING OK - Packet loss = 0%, RTA = 1.45 ms [23:34:41] !log mattflaschen Synchronized php-1.26wmf5/extensions/Flow/: Deploy Flow metadataonly fix (duration: 00m 14s) [23:34:44] Logged the message, Master [23:34:52] matt_flaschen: My change was just a beta labs config change, so I went ahead and deployed it [23:35:19] kaldari, okay. You don't need a window for Beta-only changes, though you do need to sync them to production to avoid the Puppet warnings. [23:35:37] matt_flaschen: Yep, was about to do that... [23:37:29] !log kaldari Synchronized wmf-config/InitialiseSettings-labs.php: sync InitialiseSettings-labs.php for Browse experiment in mobile (duration: 00m 13s) [23:37:34] Logged the message, Master [23:38:06] Flow is done and tested. [23:38:28] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1277981 (10Dzahn) [23:38:40] AaronSchulz: Ready? [23:38:56] Reedy [23:39:02] the panda is ready [23:39:05] :P [23:39:09] he can deploy my changes [23:39:52] Josa works like a charm [23:39:52] yuvipanda, you're going to do AaronSchulz's SWAT items from https://wikitech.wikimedia.org/wiki/Deployments#Monday.2C.C2.A0May.C2.A011 ? [23:39:56] thank you for everyone [23:39:57] RECOVERY - salt-minion processes on virt1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:40:07] RECOVERY - configured eth on virt1001 is OK - interfaces up [23:40:13] matt_flaschen: no :) [23:40:20] ... [23:40:24] I think he was joking [23:40:27] RECOVERY - Disk space on virt1001 is OK: DISK OK [23:40:36] Don't get it, but whatever. [23:40:37] RECOVERY - dhclient process on virt1001 is OK: PROCS OK: 0 processes with command name dhclient [23:40:37] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1277985 (10Dzahn) >>! In T98722#1277350, @Krenair wrote: > #WMF-NDA-Requests is really supposed to be for volunteers to use... Does it matter whether people receive compensation from WMF or not? [23:40:37] yuvipanda, not really [23:40:40] wat [23:40:46] I'm going to sleep now [23:40:47] RECOVERY - RAID on virt1001 is OK Active: 16, Working: 16, Failed: 0, Spare: 0 [23:40:47] RECOVERY - nova-compute process on virt1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [23:40:52] devunt: congrats for your work on this extension [23:40:54] oh [23:41:08] RECOVERY - DPKG on virt1001 is OK: All packages OK [23:41:11] AaronSchulz: these are mw core patches, no? [23:41:32] Dereckson, thank you [23:41:34] gute nacht [23:41:35] all core [23:42:04] I… don’t know what I can to do? [23:42:13] I was only making a Ready / Reedy joke. [23:42:16] AaronSchulz, alright, I'm skipping you for now. Decide who (me, you, or yuvipanda) is going to deploy. [23:42:23] ebernhardson, ready? [23:43:27] matt_flaschen: reedy [23:44:26] (03CR) 10Mattflaschen: [C: 032] Enable CirrusSearch-PerUser PoolCounter on group0 only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209802 (https://phabricator.wikimedia.org/T76497) (owner: 10EBernhardson) [23:44:33] (03Merged) 10jenkins-bot: Enable CirrusSearch-PerUser PoolCounter on group0 only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209802 (https://phabricator.wikimedia.org/T76497) (owner: 10EBernhardson) [23:44:40] matt_flaschen, I'll volunteer you then [23:44:47] Okay [23:45:56] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1277991 (10Dzahn) @qgil Didn't this come up before and we got a kind of blanket statement from legal/HR saying that we can assume anyone who is an employee also signed an NDA at some point? In that ca... [23:46:16] !log mattflaschen Synchronized wmf-config: Sync wmf-config for CirrusSearch PoolCounter change; applies to group 0 initially (duration: 00m 12s) [23:46:20] Logged the message, Master [23:46:41] ebernhardson, please test. [23:47:40] matt_flaschen: at first glance, seems fine. [23:47:52] will test more and monitor logs over next cpl days before turning it on elsewhere [23:48:11] ebernhardson, okay, as long as everything didn't explode. Is this about limits on how many searches people can do at once? Kind of hard to test if so. [23:48:17] RECOVERY - puppet last run on virt1001 is OK Puppet is currently enabled, last run 5 minutes ago with 0 failures [23:48:24] matt_flaschen: yea it should limit people to 5 concurrent searches [23:48:38] i may just have to spam it with nodejs or something :) [23:48:48] 10Ops-Access-Requests, 6operations: Grant ebernhardson shell account access to the elasticsearch cluster - https://phabricator.wikimedia.org/T98766#1278019 (10Dzahn) @ebernhardson Hi, do you already have a SSH key for this? Could you create one please and paste the public part, either here in phab (ticket or p... [23:50:51] 10Ops-Access-Requests, 6operations: Grant ebernhardson shell account access to the elasticsearch cluster - https://phabricator.wikimedia.org/T98766#1278026 (10EBernhardson) I already have a shell account in production with deploy and research rights, just need to extend the account to also have the rights nece... [23:52:26] matt_flaschen: actually I forgot about one more config change I need to do. Are you all finished? [23:52:28] 10Ops-Access-Requests, 6operations: Grant ebernhardson shell account access to the elasticsearch cluster - https://phabricator.wikimedia.org/T98766#1278034 (10EBernhardson) Also the ssh key there is only used in prod, and has been replaced as recently as April 14, 2015. [23:52:38] kaldari, no. You can add it, though. [23:52:41] ebernhardson: right, i was blind :) patch coming up [23:52:58] (03CR) 10Kaldari: Removing ® from mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202926 (https://phabricator.wikimedia.org/T95007) (owner: 10Kaldari) [23:53:03] mutante: thanks :) [23:53:26] We're obviously going to go a little over. [23:58:04] (03PS1) 10Dzahn: admin: ebernhardson for elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/210250 (https://phabricator.wikimedia.org/T98766) [23:58:52] matt_flaschen, I guess wmf5 should be done to, see https://gerrit.wikimedia.org/r/#/c/210243/ [23:59:08] (03PS2) 10Dzahn: admin: ebernhardson for elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/210250 (https://phabricator.wikimedia.org/T98766) [23:59:24] AaronSchulz, okay, thanks. Just for that one, right?