[00:07:44] (03PS1) 10Dzahn: motd - update-motd.d file must be executable [puppet] - 10https://gerrit.wikimedia.org/r/168220 [00:09:19] (03CR) 10Dzahn: [C: 032] "compare to permissions of the files in the same place but created from base module, like the last puppet run and the puppet role" [puppet] - 10https://gerrit.wikimedia.org/r/168220 (owner: 10Dzahn) [00:15:54] (03PS1) 10Dzahn: tor - motd script needs shebang line [puppet] - 10https://gerrit.wikimedia.org/r/168222 [00:16:44] (03CR) 10Dzahn: [C: 032] tor - motd script needs shebang line [puppet] - 10https://gerrit.wikimedia.org/r/168222 (owner: 10Dzahn) [00:18:56] (03CR) 10Dzahn: "sigh, now it ends up in /run/motd but i still don't see it displayed when i logout and back in ?!" [puppet] - 10https://gerrit.wikimedia.org/r/168220 (owner: 10Dzahn) [00:19:45] (03CR) 10Dzahn: "nevermind, impatience. works" [puppet] - 10https://gerrit.wikimedia.org/r/168220 (owner: 10Dzahn) [00:51:35] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 310 seconds [00:52:05] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 337 seconds [00:53:27] PROBLEM - Apache HTTP on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.016 second response time [00:53:35] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:54:11] PROBLEM - HHVM rendering on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.013 second response time [00:54:17] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [01:15:38] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [01:20:23] (03PS1) 10Plucas: Export $CLASSPATH from init scripts [debs/kafka] - 10https://gerrit.wikimedia.org/r/168230 [01:20:54] (03CR) 10Plucas: "This change pairs with https://gerrit.wikimedia.org/r/#/c/168230/." [puppet/kafka] - 10https://gerrit.wikimedia.org/r/163890 (owner: 10Plucas) [01:36:31] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [01:58:28] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=73%): [02:03:11] hm, why does the root partition on ocg1002 keep filling up? [02:03:47] oh! /var/log is on / [02:03:48] hm. [02:04:55] we're using logstash, we don't really need to be logging to disk any more. [02:05:44] but i'm not root on ocg1002, so i can't actually clean that up myself. [02:05:57] robh: are you around, and do you have root on ocg1002? [02:06:32] !log upgrade reboot db1065 [02:06:47] Logged the message, Master [02:06:53] cscott: i can probably do that [02:07:10] oh log files again [02:07:12] hehe [02:07:15] springle: yeah. [02:07:37] springle: i'm looking at puppet right now to figure out how to turn off on-disk logging, or at least purge them more often [02:07:45] upstart seems to log a lot too. is that duplicating ocg.log? [02:08:06] add a size limit to logrotate i guess [02:08:14] yeah [02:08:16] $ more upstart/ocg.log [02:08:16] upstart/ocg.log: Permission denied [02:08:24] grumble grumble [02:08:56] i also need to make a separate ocg-render-root group or some such, maybe. as ocg-render-admin i can only sudo as ocg, not root. [02:08:57] RECOVERY - Disk space on ocg1002 is OK: DISK OK [02:09:26] wow, /var/log/upstart/ocg.log is 1.1G [02:10:27] !log removed old /var/log/ocg.log* on ocg1002, forced a logrotate [02:10:33] Logged the message, Master [02:12:52] cscott: /var/log/upstart/ocg.log is large too. lots of [info] and [debug] notices [02:13:03] ebug: Fetched object from redis by key 3beeebf1d3452a18c760d4c3281d6c2ba12ac6b9 channel=frontend, id=1413872775434-75539, id=3beeebf1d3452a18c760d4c3281d6c2ba12ac6b9 [02:13:32] looks like the same contents as the logstash logs then. [02:15:22] !log LocalisationUpdate completed (1.25wmf3) at 2014-10-23 02:15:22+00:00 [02:15:31] Logged the message, Master [02:23:27] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 67858 bytes in 0.574 second response time [02:23:57] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.061 second response time [02:26:06] (03PS1) 10Springle: repool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168236 [02:27:02] oh LinksUpdate, how we love thee [02:27:49] !log LocalisationUpdate completed (1.25wmf4) at 2014-10-23 02:27:48+00:00 [02:27:56] Logged the message, Master [02:29:52] (03CR) 10Springle: [C: 032] repool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168236 (owner: 10Springle) [02:29:59] (03Merged) 10jenkins-bot: repool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168236 (owner: 10Springle) [02:31:17] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [02:31:57] !log springle Synchronized wmf-config/db-eqiad.php: repool db1065, warm up (duration: 00m 06s) [02:32:02] Logged the message, Master [02:40:17] (03PS1) 10coren: Gridengine: proper puppetization, part deux [puppet] - 10https://gerrit.wikimedia.org/r/168237 [02:40:57] (03CR) 10jenkins-bot: [V: 04-1] Gridengine: proper puppetization, part deux [puppet] - 10https://gerrit.wikimedia.org/r/168237 (owner: 10coren) [02:41:32] (03PS11) 10Krinkle: Gzip SVGs on back upload varnishes [puppet] - 10https://gerrit.wikimedia.org/r/108484 (https://bugzilla.wikimedia.org/54291) (owner: 10Ori.livneh) [02:41:53] (03PS4) 10Krinkle: Gzip .svg and .ico files on bits.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/113687 (https://bugzilla.wikimedia.org/61442) (owner: 10Brion VIBBER) [02:42:01] (03PS1) 10Brian Wolff: Partial revert f7be7e6a. Change commons abbrvThreshold back to 160. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168239 (https://bugzilla.wikimedia.org/72389) [02:42:52] (03PS3) 10Krinkle: rcstream: make lvs health check fetch /nginx_status [puppet] - 10https://gerrit.wikimedia.org/r/145997 (https://bugzilla.wikimedia.org/67957) (owner: 10Ori.livneh) [02:43:03] (03PS2) 10coren: Gridengine: proper puppetization, part deux [puppet] - 10https://gerrit.wikimedia.org/r/168237 [02:43:08] ori: What's going on with https://gerrit.wikimedia.org/r/#/c/145997/? If that is what I think it is, it's important. [02:45:38] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [02:47:21] (03PS1) 10Cscott: Ensure that ocg logs are kept small. [puppet] - 10https://gerrit.wikimedia.org/r/168240 [02:47:41] (03CR) 10coren: [C: 032] "There's honestly very little to do but try it out. As with the previous patch, this only generates config files and noops on the actual c" [puppet] - 10https://gerrit.wikimedia.org/r/168237 (owner: 10coren) [02:50:06] If anyone is awake and around and up for deploying config changes: https://gerrit.wikimedia.org/r/#/c/168239/ reverts an accidental config change that totally broke using any file from commons where the file name has a length between 140 and 159 [02:51:06] That's an awfully specific bug. It's a bit late for me and I fear that I may make a mess of things. :-( [02:51:27] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [02:52:22] Coren: I'm honestly expecting to have to wait until tommorow morning [02:53:02] Coren: But if it makes you feel better, the change its partially reverting looks very clearly accidental (It looks like someone did git commit -a and didn't realize an extra file was commited) [02:53:07] springle: ocg1003 needs its /var/log cleaned up. alternatively: https://gerrit.wikimedia.org/r/168240 [02:53:37] RECOVERY - DPKG on labmon1001 is OK: All packages OK [02:54:10] Since the other change had a commit summary of "Disable log", and it in addition to disabling the log, touches another file with no mention of that in the commit summary [02:54:43] bawolff: I'm not worried about the substance, but about my capacity to deploy something in prod without breaking everything. :-) [02:55:22] bawolff: i can probably do it [02:55:23] (03PS1) 10coren: Gridengine: trivial fix of an incorrect dependency [puppet] - 10https://gerrit.wikimedia.org/r/168242 [02:55:37] cscott: Cool, thanks :) [02:55:37] My own mostly harmless patch has an impressive number of attention errors already. [02:56:21] (03CR) 10coren: [C: 032] "Trivial fix is trivial." [puppet] - 10https://gerrit.wikimedia.org/r/168242 (owner: 10coren) [02:56:22] Coren: Hard to blame you, if I ever had deployment rights, I would be utterly terrified of exploding everything [02:57:13] I'm not gereally that timid, but when I'm late and tired my confidence goes with it. :-) [02:57:29] bawolff: let's see if some of the usual SWAT culprits like MARKTRACEUR or Reedy are around first, they've deployed a lot more often than me [02:58:04] and last time i deployed there was some weird issue with the video transcode boxes which I'm only pretty sure has been fixed... [02:58:31] cscott: You mean where they got super busy? That was brion's fault, not yours :P [02:58:52] ^d, MARKTRACEUR, RoanKattouw_away: ping re a SWAT deploy [02:59:17] bawolff: no, my ssh key wasn't on those particular boxes because i was a member of one admin group but not some other super-ancient one. [02:59:49] cscott: Ah. I jumped to conclusions because there was a recent issue from tuesday [02:59:52] hmm, just woke up in .nl and upload.wm.org is unreachable for me. [03:00:50] apparently there's been some trouble at ams-ix tonight. might be that some routing of ams-ix peers is still not corrected [03:01:21] !log git-deploy: Deploying integration/slave-scripts 157ef23 [03:01:27] Logged the message, Master [03:01:39] Krinkle: also, if i just woke up... then .... [03:01:55] ? [03:02:14] bawolff: well, no one is jumping at this, let me take a shot [03:02:15] then u r up late :) [03:02:28] thedj: it's an hour earlier for me in my defence [03:02:59] barely :) [03:03:15] I took an evening nap. Needed the extra sleep [03:03:24] checking in now and then going back for more [03:03:25] (03CR) 10Cscott: [C: 032] Partial revert f7be7e6a. Change commons abbrvThreshold back to 160. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168239 (https://bugzilla.wikimedia.org/72389) (owner: 10Brian Wolff) [03:03:31] (03PS1) 10coren: Tool Labs: fix template ruby error [puppet] - 10https://gerrit.wikimedia.org/r/168243 [03:03:35] (03Merged) 10jenkins-bot: Partial revert f7be7e6a. Change commons abbrvThreshold back to 160. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168239 (https://bugzilla.wikimedia.org/72389) (owner: 10Brian Wolff) [03:05:30] (03CR) 10coren: [C: 032] "YATF." [puppet] - 10https://gerrit.wikimedia.org/r/168243 (owner: 10coren) [03:06:10] !log cscott Synchronized wmf-config/filebackend.php: fix using a file from commons with file name length between 140 and 159 (duration: 00m 20s) [03:06:17] Logged the message, Master [03:06:25] bawolff: ok, all synced [03:06:36] bawolff: test, please? [03:06:37] Thank you. Let me check if things work now [03:06:54] cscott: i think we'll clean up ocg1001 and 3 first, then merge your patch. logrotate may not have the disk space to gzip [03:06:54] confirmed, https://en.wikipedia.org/wiki/File:US_Navy_120209-N-XD935-302_Mass_Communication_Specialist_1st_Class_Shane_Tuck,_assigned_to_the_Expeditionary_Combat_Camera_Underwater_Photo_Team,_c.jpg displays properly now [03:06:59] yay! [03:07:06] Oh, seriously? I *hate* ruby. [03:07:43] and i can confirm that the old ssh error was indeed fixed and i'm a fully-capable deployer now [03:08:10] * cscott feels the power [03:08:17] RECOVERY - Disk space on ocg1003 is OK: DISK OK [03:08:29] springle: sounds good [03:08:55] (03PS2) 10Springle: Ensure that ocg logs are kept small. [puppet] - 10https://gerrit.wikimedia.org/r/168240 (owner: 10Cscott) [03:09:19] and my patch doesn't fix the /var/log/upstart/ocg* issue, because /etc/logrotate.d/upstart is read after /etc/logrotate.d/ocg so i can't override the upstart logrotate config AFAIK [03:09:28] right [03:09:37] but it should halve the problem [03:09:54] springle: so i'll fix that tomorrow by turning off console logs (https://gerrit.wikimedia.org/r/168241 needs to be deployed) [03:10:26] (03CR) 10Springle: [C: 032] Ensure that ocg logs are kept small. [puppet] - 10https://gerrit.wikimedia.org/r/168240 (owner: 10Cscott) [03:11:05] i considered renaming /etc/logrotate.d/ocg to /etc/logrotate.d/zz-ocg or some such, but it seemed too evil [03:13:13] (03Abandoned) 10Cscott: Revert "Re-enable PediaPress POD in production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168082 (https://bugzilla.wikimedia.org/71675) (owner: 10MarkTraceur) [03:13:44] (03PS1) 10coren: Tool Labs: yet another funky ruby type error [puppet] - 10https://gerrit.wikimedia.org/r/168244 [03:14:58] (03CR) 10coren: [C: 032] Tool Labs: yet another funky ruby type error [puppet] - 10https://gerrit.wikimedia.org/r/168244 (owner: 10coren) [03:15:47] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [03:18:05] (03PS1) 10coren: Tool Labs: trivial fix to the gridengine manifest [puppet] - 10https://gerrit.wikimedia.org/r/168245 [03:18:16] * Coren *loves* the puppet compiler not working in labs. [03:18:58] (03CR) 10coren: [C: 032] Tool Labs: trivial fix to the gridengine manifest [puppet] - 10https://gerrit.wikimedia.org/r/168245 (owner: 10coren) [03:21:15] (03PS1) 10coren: Tool Labs: remove stray dependency [puppet] - 10https://gerrit.wikimedia.org/r/168246 [03:21:27] PROBLEM - Apache HTTP on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [03:22:03] (03CR) 10coren: [C: 032] Tool Labs: remove stray dependency [puppet] - 10https://gerrit.wikimedia.org/r/168246 (owner: 10coren) [03:22:08] PROBLEM - HHVM rendering on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.025 second response time [03:22:09] And that's enough for tonight. [03:36:27] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [03:41:49] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Oct 23 03:41:49 UTC 2014 (duration 41m 48s) [03:41:57] Logged the message, Master [03:45:18] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [04:03:49] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [04:55:50] did anyone stop hhvm on mw1114? [05:06:40] !log reverted unexplained uncomitted modification of palladium:/srv/pybal-config/pybal/eqiad/api which repooled mw1189 [05:06:50] Logged the message, Master [05:39:19] !log on mw1189 testing some URLs at a high rate, attempting to induce measurable memory leak [05:39:30] Logged the message, Master [05:40:21] (03CR) 10Brian Wolff: "For reference, this was partially reverted in https://gerrit.wikimedia.org/r/#/c/168239/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167731 (owner: 10Aaron Schulz) [05:50:12] (03PS1) 10Nikerabbit: Disable l10nupdate for the duration of CLDR 26 plural migration [puppet] - 10https://gerrit.wikimedia.org/r/168255 [06:28:21] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: puppet fail [06:28:28] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: puppet fail [06:28:48] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [06:28:58] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: puppet fail [06:29:18] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: puppet fail [06:29:19] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [06:29:58] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:11] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:21] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:27] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:39] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:57] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:57] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:57] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:17] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:29] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:37] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:39] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:48] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:20] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:45:10] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:45:13] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:45:20] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:45:31] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:45:51] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:45:53] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:01] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:40] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:51] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:47:01] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:47:11] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:47:11] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:47:11] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:47:11] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:47:20] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:47:25] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:47:45] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:48:00] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:14:26] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 67813 bytes in 0.314 second response time [07:14:27] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time [07:25:42] <_joe_> !log restarted hhvm on mw1114, depooled the server [07:25:48] Logged the message, Master [07:50:56] (03CR) 10Kaldari: "What is the "CLDR 26 plural migration"?" [puppet] - 10https://gerrit.wikimedia.org/r/168255 (owner: 10Nikerabbit) [08:02:22] (03CR) 10Nikerabbit: "From our draft announcement:" [puppet] - 10https://gerrit.wikimedia.org/r/168255 (owner: 10Nikerabbit) [08:46:36] PROBLEM - DPKG on mw1114 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:46:58] <_joe_> ? [08:47:36] RECOVERY - DPKG on mw1114 is OK: All packages OK [09:30:14] (03CR) 10Alexandros Kosiaris: [C: 032] Open up more zookeeper ports in ferm [puppet] - 10https://gerrit.wikimedia.org/r/168185 (owner: 10Ottomata) [09:32:28] <_joe_> zookeeper [09:36:25] it's for hadoop [09:36:30] <_joe_> yes I know [09:36:57] <_joe_> I still have PTSD from when I had to make the network team at WORK~1 to open ports for it [09:37:08] <_joe_> (hadoop) [09:37:29] <_joe_> it took them 4 iterations and 5 Change Advisory Boards to get it right [09:37:50] <_joe_> so it's probably more PTSD about ITIL than ZK and hadoop themselves [09:41:34] this week's lwn has an article about etcd and fleet [09:43:11] _joe_: you had to deal with a Change Advisory board ? [09:43:24] and you actually pulled through ? [09:55:02] (03PS1) 10Alexandros Kosiaris: Introduce uranium as ganglia server [puppet] - 10https://gerrit.wikimedia.org/r/168263 [10:00:21] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce uranium as ganglia server [puppet] - 10https://gerrit.wikimedia.org/r/168263 (owner: 10Alexandros Kosiaris) [10:04:36] akosiaris: \o/ thanks [10:05:38] PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: Puppet has 7 failures [10:06:17] 7 puppet failures... sigh [10:06:37] godog: :-) let's now see what's unpuppetized on nickel :-) [10:06:53] heheh "will it converge?" on the "will it blend?" tune [10:07:16] hmm... so those 7 puppet failures ? [10:07:22] seems they are correct [10:07:55] all seem to be about ganglia web views and fail because the top directory is missing [10:07:59] simple enough ... [10:09:13] _joe_: I'm going to take a stab at the Hiera for Wikitech thing now, and see how far I get... [10:32:48] <_joe_> YuviPanda: if you need some help, I used to be a php programmer [10:32:56] <_joe_> that is, before I chose life [10:33:20] _joe_: :D Having a look at https://wikitech.wikimedia.org/wiki/User:Yuvipanda/Wikitech_hiera and telling me if I missed something glaringly obvious would be nice [10:33:33] because I haven't actually used hiera yet in actuality, just read up on it... [10:34:43] <_joe_> YuviPanda: so wikitech api would expose a yaml file to our backend, right? [10:34:48] <_joe_> I can work on that [10:34:49] _joe_: yup [10:37:06] _joe_: OpenStackManager is just a massive pain to develop locally, though. we have a vagrant env but it barely works [10:39:10] <_joe_> YuviPanda: one important thing would be that the api endpoint would expose the last modification date of the hiera data for the given project via a HEAD request [10:39:27] ah, hmm. why? [10:39:37] <_joe_> so that we don't need to hit the full api and/or do an http request for each run [10:39:51] why? it's quite fast. [10:39:59] <_joe_> hiera does _a_lot_ of lookups [10:40:09] <_joe_> every lookup is separate from the others [10:40:17] but we would download the YAML file only once, no? [10:40:21] and cache that locally for each run [10:40:38] <_joe_> when do I know it's changed? [10:40:50] <_joe_> either we decide we re-download files every minute [10:41:43] so the idea is that there's a wiki page with YAML in it, and we just download it and do lookups as usual. So you'll just get 'this is the YAML file for this project' rather than querying individual keys. [10:42:08] so you could potentially just get it at the start of the run, without worrying about what was there before, and use it for each run [10:42:11] <_joe_> hiera doesn't work like that [10:42:41] <_joe_> puppet does a hiera lookup which is independent of the others for each run [10:43:25] ah, hmm. so for prod hiera data, does that mean the entire YAML file is re-parsed for every lookup? [10:43:30] <_joe_> no [10:43:36] <_joe_> it's cached in-memory [10:43:53] <_joe_> but for each run, it's stat() ed to see if the file has changes [10:44:37] <_joe_> stat() is a relatively cheap syscall [10:44:52] hmm, is it cached in the puppetmaster, or in the puppet agent on each machine? [10:45:01] <_joe_> on the master [10:45:05] <_joe_> hiera runs on the master [10:45:08] ah, I see [10:45:43] _joe_: hmm, if you're going to use the last modification date only for caching, we can just use the rev-id instead, which should work just as well [10:45:49] <_joe_> yes [10:45:52] <_joe_> yes! [10:45:54] <_joe_> perfect [10:46:06] that's already exposed, and is fairly quick [10:46:09] <_joe_> btw, we probably need a completely new hiera backend for this [10:46:21] <_joe_> YuviPanda: ok so we're set I'd say [10:46:33] cool. I've a preliminary patch up, am testing. [10:46:39] <_joe_> ok [10:46:48] <_joe_> hiera will require some work I guess [10:46:48] when that's fine, I'll just investigate ways to set the ACL in place, and then we can get it deployed next week. [10:46:57] <_joe_> but you're making me happy :) [10:47:04] heh :) [10:47:28] shinkengen's host generation is working fairly well, and I've to spend some more time thinking about the services gen, so this is a nice distraction in the meantime [11:52:44] (03PS1) 10Alexandros Kosiaris: Remove pmtpa from deployment [puppet] - 10https://gerrit.wikimedia.org/r/168273 [11:56:03] ^d, Krinkle: it seems zuul is not running any jobs anymore. can someone help? [12:00:12] jzerebecki: checking [12:09:10] !log Zuul/Jenkins stuck. Tried various gearman/zuul resets. Restarting Jenkins now. [12:09:18] Logged the message, Master [12:09:27] (03CR) 10QChris: [C: 031] Remove all kraken references [puppet] - 10https://gerrit.wikimedia.org/r/168147 (owner: 10Ottomata) [12:12:02] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [12:19:12] Krinkle: thank you [12:19:44] Krinkle: btw, dead instances should be purged from graphite automatically now. Runs as a cron so might take a while [12:20:43] there's also a bug / feature in txstatsd that might make them be re-created after being purged, though. [12:20:49] also they're archived and not purged [12:22:01] (03PS2) 10Krinkle: Remove pmtpa from deployment role [puppet] - 10https://gerrit.wikimedia.org/r/168273 (owner: 10Alexandros Kosiaris) [12:45:33] godog: https://commons.wikimedia.org/wiki/File:Esso_Standard_Oil_Company_commercial_1938.webm <-- where is the file? :) swift issue? [12:46:03] https://upload.wikimedia.org/wikipedia/commons/9/97/Esso_Standard_Oil_Company_commercial_1938.webm WFM? [12:46:32] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [12:46:44] !log killed left over java/jenkins process on gallium [12:46:51] Logged the message, Master [12:47:13] Reedy: i get a blank page [12:47:37] On which url? [12:47:37] then after a long time the player shows, but nothing is playable [12:47:44] https://upload.wikimedia.org/wikipedia/commons/9/97/Esso_Standard_Oil_Company_commercial_1938.webm [12:48:04] and the the file page no thumbnail/player at all [12:48:23] would a screenshot help Reedy ? [12:48:28] matanya: curious, WFM [12:48:37] i'll screenshot [12:50:20] godog, Reedy : https://commons.wikimedia.org/wiki/File:Where_is_the_player.png [12:50:46] matanya: I don't think that's a swift issue [12:50:51] The physical file is there [12:51:07] MW/extension is just doing something weird [12:55:06] just to add, for what it's worth, it repeats the behaviour with new uploads and existing files. [13:00:04] K4: Dear anthropoid, the time has come. Please deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141023T1300). [13:16:51] PROBLEM - Disk space on mw1088 is CRITICAL: DISK CRITICAL - free space: / 7718 MB (3% inode=94%): [13:23:04] Krinkle: is zuul being in queue only mode expected because someone is messing with stuff or did something break again? [13:23:38] jzerebecki: it's in queue-only mode to reset the connection with jenkins (mostly collecting old garbage) [13:23:47] it was already taking new jobs again without problems before I did that [13:23:54] it seems the old ones aren't running though [13:25:05] stopping now without queue preservation unfortunately [13:25:07] and up again [13:26:04] thx [13:33:53] (03CR) 10Ottomata: [C: 032 V: 032] Export $CLASSPATH from init scripts [debs/kafka] - 10https://gerrit.wikimedia.org/r/168230 (owner: 10Plucas) [13:34:54] (03CR) 10Ottomata: [C: 032] Add option for extra classpath entries [puppet/kafka] - 10https://gerrit.wikimedia.org/r/163890 (owner: 10Plucas) [13:35:02] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: puppet fail [13:36:10] (03PS2) 10Ottomata: Remove all kraken references [puppet] - 10https://gerrit.wikimedia.org/r/168147 [13:38:12] it feels like spring cleaning [13:40:00] (03CR) 10Ottomata: [C: 032] Remove all kraken references [puppet] - 10https://gerrit.wikimedia.org/r/168147 (owner: 10Ottomata) [13:41:19] _joe_: are you happy with https://gerrit.wikimedia.org/r/#/c/167713/ now? [13:46:22] <_joe_> andrewbogott: will look soon-ish [13:51:08] cmjohnson: moornnniiin! [13:51:48] good morning [13:52:24] just checking in about new es node racking status [13:53:46] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [13:54:16] nodes are racked but there are other things that need to be done. working on dns now. the disks all need to be replaced and they need to be cabled. [13:58:01] mmk, cool [13:58:04] thanks for the update! [13:59:13] no problem. I should have them ready by end of day today [13:59:25] (03PS2) 10QChris: Require 2 ACKs from kafka brokers for bits caches [puppet] - 10https://gerrit.wikimedia.org/r/167552 (https://bugzilla.wikimedia.org/69667) [14:11:25] ottomata:elastic1017-19 currently working? if so can you take them offline [14:12:06] cmjohnson: 1019 is offline, you can remove it [14:12:22] 1017 and 1018 are in the cluster, and we were hoping not to remove any more nodes at least until we got a few of the new ones up [14:12:54] hrm...okay new servers are supposed to replace those..so I will leave them until after 1020+ are up [14:13:41] <_joe_> andrewbogott: as long as you _promise_ to rename dashed classes, I am [14:13:56] _joe_: I promise :) [14:15:10] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: Puppet has 1 failures [14:16:32] (03CR) 10Giuseppe Lavagetto: [C: 031] "dashed-named-classes-must-burn-in-hell.pp But that can be corrected in another commit." [puppet] - 10https://gerrit.wikimedia.org/r/167713 (owner: 10Andrew Bogott) [14:20:07] RECOVERY - check_puppetrun on db1025 is OK: OK: Puppet is currently enabled, last run 252 seconds ago with 0 failures [14:23:36] (03CR) 10Ottomata: [C: 032] Require 2 ACKs from kafka brokers for bits caches [puppet] - 10https://gerrit.wikimedia.org/r/167552 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [14:25:02] !log varnishkafka request.required.acks is now 2 for text, mobile, and bits. [14:25:07] Logged the message, Master [14:25:24] PROBLEM - check if salt-minion is running on virt1006 is CRITICAL: Connection refused by host [14:43:30] Morning y'all. Is anyone up for SWAT? :) [14:44:14] RECOVERY - check if dhclient is running on virt1006 is OK: PROCS OK: 0 processes with command name dhclient [14:44:35] RECOVERY - DPKG on virt1006 is OK: All packages OK [14:44:57] RECOVERY - Disk space on virt1006 is OK: DISK OK [14:45:03] RECOVERY - check configured eth on virt1006 is OK: NRPE: Unable to read output [14:45:03] RECOVERY - RAID on virt1006 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [14:45:57] marktraceur: I may as well do it, since I have the only patches [14:49:10] <^d> anomie is cap'n'in the good ship swat this fine morn'? [14:49:48] ^d: Aye. But I thought "talk like a pirate day" was last month. [14:50:14] Yarr, evry day be talk like a pirate day if'n the right mood hit ye [14:50:20] <^d> i never talk like a pirate. i was trying something new. [14:50:22] (03PS1) 10Cmjohnson: Adding dns entries for elastic1020-1031 [dns] - 10https://gerrit.wikimedia.org/r/168297 [14:56:02] (03CR) 10Hoo man: [C: 031] "Looks good to me (assuming mediawiki-verp.wmflabs.org is set up correctly... the mx record looks good)." [puppet] - 10https://gerrit.wikimedia.org/r/168175 (owner: 1001tonythomas) [14:59:08] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for elastic1020-1031 [dns] - 10https://gerrit.wikimedia.org/r/168297 (owner: 10Cmjohnson) [14:59:42] (03PS1) 10Ejegg: Turn off Special:HideBanners filter [puppet] - 10https://gerrit.wikimedia.org/r/168298 [15:00:04] manybubbles, anomie, ^d, marktraceur, anomie: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141023T1500). [15:00:06] * anomie begins SWAT [15:00:42] anomie: You're first [15:03:12] Now wait for anomie to respond so he can test. [15:03:39] marktraceur: I already saw anomie talking in the channel, so I know he's here [15:03:55] Oh, good. [15:05:19] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Not arguing about the change per-se, but this should really be done using hiera." [puppet] - 10https://gerrit.wikimedia.org/r/168175 (owner: 1001tonythomas) [15:05:54] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:08:31] (03CR) 10Jgreen: [C: 032 V: 031] Turn off Special:HideBanners filter [puppet] - 10https://gerrit.wikimedia.org/r/168298 (owner: 10Ejegg) [15:11:12] !log anomie Synchronized php-1.25wmf4/includes/api/ApiFormatFeedWrapper.php: SWAT: Fix ApiFormatFeedWrapper [[gerrit:168128]] (duration: 00m 09s) [15:11:13] anomie: ^ test please [15:11:19] Logged the message, Master [15:11:23] anomie: Works! [15:11:32] (03PS2) 10Ejegg: RecordImpression log should not depend on qs order [puppet] - 10https://gerrit.wikimedia.org/r/168186 [15:11:45] * anomie does the second patch [15:13:10] (03PS2) 10Ejegg: Turn off Special:HideBanners filter [puppet] - 10https://gerrit.wikimedia.org/r/168298 [15:14:12] (03CR) 10Jgreen: [C: 032 V: 031] RecordImpression log should not depend on qs order [puppet] - 10https://gerrit.wikimedia.org/r/168186 (owner: 10Ejegg) [15:19:23] !log anomie Synchronized php-1.25wmf4/includes/api/ApiMain.php: SWAT: Include ApiMain construction in api.php try-catch block [[gerrit:168128]] (duration: 00m 09s) [15:19:29] Logged the message, Master [15:19:33] !log anomie Synchronized php-1.25wmf4/api.php: SWAT: Include ApiMain construction in api.php try-catch block [[gerrit:168296]] (duration: 00m 09s) [15:19:34] anomie: ^^^ ^ Test please [15:19:37] Logged the message, Master [15:19:49] anomie: Seems to work! [15:20:04] * anomie is done with a very self-referential SWAT [15:20:56] (03PS1) 10Jgreen: tweak fundraising banner log rotation, no longer collecting banner 'hide' hits [puppet] - 10https://gerrit.wikimedia.org/r/168301 [15:21:55] (03CR) 10Jgreen: [C: 032 V: 031] tweak fundraising banner log rotation, no longer collecting banner 'hide' hits [puppet] - 10https://gerrit.wikimedia.org/r/168301 (owner: 10Jgreen) [15:29:20] I love reading the morning swat scrollback [15:31:42] !log springle Synchronized wmf-config/db-eqiad.php: db1065 to normal load (duration: 00m 08s) [15:31:47] Logged the message, Master [15:32:32] (03PS1) 10Alexandros Kosiaris: Make ganglia::web trusty aware and puppetized [puppet] - 10https://gerrit.wikimedia.org/r/168302 [15:33:17] (03CR) 10Jgreen: [C: 032 V: 031] Added the beta hostname to local_domains to make exim use mw_verp_api [puppet] - 10https://gerrit.wikimedia.org/r/168175 (owner: 1001tonythomas) [15:35:40] (03CR) 10BryanDavis: "> Not arguing about the change per-se, but this should really be done using hiera." [puppet] - 10https://gerrit.wikimedia.org/r/168175 (owner: 1001tonythomas) [15:37:22] Jeff_Green: did the git pull ? [15:44:15] (03PS1) 10coren: Tool Labs: Gridengine puppetization, III [puppet] - 10https://gerrit.wikimedia.org/r/168306 [15:46:05] !log preparing to upgrade JunOS on cr2-ulsfo [15:46:11] Logged the message, Master [15:46:28] YuviPanda|zzz: Want to be the one to look over Gridengine the third? :-) ^^ [15:47:00] ottomata: hey [15:47:18] hiya [15:47:32] Snaps wants verification that the fix he prepared works [15:47:49] that's blocking 0.8.5, which in turn blocks an upload to Debian to reach the upcoming freeze [15:49:09] the fix we had him make for kafkatee? [15:49:22] not sure, it's a librdkafka fix [15:49:26] yes [15:49:29] can you check with him, he's online [15:49:34] well, ha, if he wants confirmation from me, it is likely that one :) [15:49:36] yeah, will do [15:49:45] we were going to wait a couple of weeks (he fixed this friday?) to make sure [15:49:51] 18:04 have you seen andrew online? [15:49:52] 18:05 need a verdict from him [15:50:15] qchris: ^ [15:51:09] The first few days of data looks good. [15:51:33] But the issue sometimes occurred some weeks after starting. [15:51:57] And the buggy version started fine too. [15:56:20] (03PS1) 10Rush: phab update php post_max_size to 10MB [puppet] - 10https://gerrit.wikimedia.org/r/168308 [15:56:24] I guess the fix is good enough for a release. [15:56:27] (03PS2) 10coren: Tool Labs: Gridengine puppetization, III [puppet] - 10https://gerrit.wikimedia.org/r/168306 [15:58:35] (03CR) 10Rush: [C: 032] phab update php post_max_size to 10MB [puppet] - 10https://gerrit.wikimedia.org/r/168308 (owner: 10Rush) [15:59:10] PROBLEM - Host cr2-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.193) [16:00:05] tgr: Dear anthropoid, the time has come. Please deploy ImageMetrics (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141023T1600). [16:00:19] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 51, down: 6, dormant: 0, excluded: 1, unused: 0BRxe-0/0/0: down - cr2-ulsfo:xe-0/0/0 [10Gbps DF]BRae0.3: down - cr2-ulsfo:ae0.3BRae0.2: down - cr2-ulsfo:ae0.2BRxe-1/0/0: down - cr2-ulsfo:xe1/0/0 [10Gbps DF]BRae0.32767: down - BRae0: down - cr2-ulsfo:ae0BR [16:01:19] (that's me, see !log above) [16:07:47] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 63, down: 0, dormant: 0, excluded: 1, unused: 0 [16:08:48] RECOVERY - Host cr2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 73.88 ms [16:12:36] (03CR) 10Gergő Tisza: [C: 032] Add ImageMetrics extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167727 (https://bugzilla.wikimedia.org/70402) (owner: 10Gergő Tisza) [16:12:53] (03Merged) 10jenkins-bot: Add ImageMetrics extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167727 (https://bugzilla.wikimedia.org/70402) (owner: 10Gergő Tisza) [16:17:10] (03CR) 10BryanDavis: [C: 031] Disable l10nupdate for the duration of CLDR 26 plural migration [puppet] - 10https://gerrit.wikimedia.org/r/168255 (owner: 10Nikerabbit) [16:21:37] (03CR) 10coren: [C: 032] "More noop configuration building. Soon. Soon it shall be ready. Muahahaha!" [puppet] - 10https://gerrit.wikimedia.org/r/168306 (owner: 10coren) [16:23:00] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [16:29:17] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [16:35:59] !log tgr Started scap: Deploying ImageMetrics extension [16:36:06] Logged the message, Master [16:36:35] tgr: Break a leg etc. [16:36:45] Also, get comfy. :) [16:37:24] * bd808 guesses 26 minutes [16:41:03] (03PS1) 10coren: Tool Labs: fixes for gridengine puppetization III [puppet] - 10https://gerrit.wikimedia.org/r/168319 [16:48:51] (03PS1) 10Glaisher: Add 'mergehistory' to transwiki group at itwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168322 (https://bugzilla.wikimedia.org/72422) [16:50:03] (03PS2) 10Glaisher: Add 'mergehistory' to transwiki group at itwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168322 (https://bugzilla.wikimedia.org/72422) [16:53:38] (03PS1) 10Gergő Tisza: Enable ImageMetrics extension on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168326 [16:54:48] (03PS2) 10Gergő Tisza: Enable ImageMetrics extension on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168326 [16:54:55] !log cr2-ulsfo: upgrading junos again [16:55:02] Logged the message, Master [16:57:24] (03PS5) 10Gage: Enable GELF for MRAppManager part 2 [puppet] - 10https://gerrit.wikimedia.org/r/167044 [17:04:51] (03CR) 10coren: [C: 032] "Trivial fixes." [puppet] - 10https://gerrit.wikimedia.org/r/168319 (owner: 10coren) [17:05:16] greg-g: I'm a bit overtime with the deployment, scap is still running [17:07:45] there is no window this hour so no big deal I suppose? [17:07:59] yeah, you're ok tgr [17:08:03] !log tgr Finished scap: Deploying ImageMetrics extension (duration: 32m 04s) [17:08:10] Logged the message, Master [17:09:33] (03PS3) 10Gergő Tisza: Enable ImageMetrics extension on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168326 [17:10:49] (03PS4) 10Gergő Tisza: Enable ImageMetrics extension on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168326 [17:11:15] (03CR) 10Gergő Tisza: [C: 032] Enable ImageMetrics extension on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168326 (owner: 10Gergő Tisza) [17:11:18] PROBLEM - Host cr2-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.193) [17:11:22] (03Merged) 10jenkins-bot: Enable ImageMetrics extension on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168326 (owner: 10Gergő Tisza) [17:11:23] (me again) [17:11:44] bd808: off by 6 minutes [17:11:49] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 51, down: 6, dormant: 0, excluded: 1, unused: 0BRxe-0/0/0: down - cr2-ulsfo:xe-0/0/0 [10Gbps DF]BRae0.3: down - cr2-ulsfo:ae0.3BRae0.2: down - cr2-ulsfo:ae0.2BRxe-1/0/0: down - cr2-ulsfo:xe1/0/0 [10Gbps DF]BRae0.32767: down - BRae0: down - cr2-ulsfo:ae0BR [17:12:18] greg-g: :( I have a desire to look into why that I am supressing. [17:12:33] :) good [17:13:56] (03PS1) 10coren: Tool Labs: more minor fixes to gridengine [puppet] - 10https://gerrit.wikimedia.org/r/168328 [17:14:18] !log tgr Synchronized wmf-config/InitialiseSettings.php: Enable ImageMetrics on group0 (duration: 00m 05s) [17:14:24] Next on my project list: write a puppet compiler for/in labs. [17:14:25] Logged the message, Master [17:14:48] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:14:49] (03CR) 10coren: [C: 032] Tool Labs: more minor fixes to gridengine [puppet] - 10https://gerrit.wikimedia.org/r/168328 (owner: 10coren) [17:17:44] (03PS1) 10Gergő Tisza: Enable ImageMetrics extension on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168329 [17:18:13] (03PS4) 10Andrew Bogott: Move openstack files and manifests into a module [puppet] - 10https://gerrit.wikimedia.org/r/167713 [17:19:16] (03PS1) 10Alexandros Kosiaris: ganglia plugin for openstreetmap sync delay [puppet] - 10https://gerrit.wikimedia.org/r/168330 [17:19:56] (03CR) 10jenkins-bot: [V: 04-1] ganglia plugin for openstreetmap sync delay [puppet] - 10https://gerrit.wikimedia.org/r/168330 (owner: 10Alexandros Kosiaris) [17:20:07] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 63, down: 0, dormant: 0, excluded: 1, unused: 0 [17:20:12] (03PS1) 10coren: Tool Labs: allow exec hosts to be submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/168332 [17:21:08] RECOVERY - Host cr2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.57 ms [17:21:31] (03CR) 10coren: [C: 032] "Trivial." [puppet] - 10https://gerrit.wikimedia.org/r/168332 (owner: 10coren) [17:23:22] who watched/worked on the ulsfo thing? [17:23:27] is that a planned maint? [17:23:44] it make some hay on our office connectivity to google [17:23:47] cajoel: you mean the ulsfo going offline? [17:23:49] (messed it up briefly) [17:23:50] yep [17:23:56] there have been emails about it to ops list i think [17:24:01] (03CR) 10Andrew Bogott: [C: 032] Move openstack files and manifests into a module [puppet] - 10https://gerrit.wikimedia.org/r/167713 (owner: 10Andrew Bogott) [17:24:09] greg-g: I am stuck, after adding the extension and scapping I enabled for group0 in InitialiseSettings.php and synced the file, but that seems to have no effect on mw.org [17:24:11] but the links there are either wonky or attached to hardware that has been wonky [17:24:17] the extension works fine on testwiki [17:24:28] I thought gage swapped em? [17:24:34] cajoel: and faidon is working on them prsently [17:24:38] its still ongoing [17:24:40] cajoel: I'm replying to you on the other channel [17:24:42] do I need an extra step after the sync-file? [17:24:43] kk [17:25:03] but we can move it here if you want [17:27:18] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: puppet fail [17:28:08] (03PS2) 10Alexandros Kosiaris: ganglia plugin for openstreetmap sync delay [puppet] - 10https://gerrit.wikimedia.org/r/168330 [17:30:04] tgr: You shouldn't need anything else. [17:31:10] (03PS3) 10Alexandros Kosiaris: Remove pmtpa from deployment role [puppet] - 10https://gerrit.wikimedia.org/r/168273 [17:31:43] (03PS1) 10Andrew Bogott: s/backup::host/role::backup::host as per a recent change. [puppet] - 10https://gerrit.wikimedia.org/r/168334 [17:32:00] (03CR) 10Alexandros Kosiaris: [C: 032] Remove pmtpa from deployment role [puppet] - 10https://gerrit.wikimedia.org/r/168273 (owner: 10Alexandros Kosiaris) [17:32:07] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia plugin for openstreetmap sync delay [puppet] - 10https://gerrit.wikimedia.org/r/168330 (owner: 10Alexandros Kosiaris) [17:33:13] (03CR) 10Andrew Bogott: [C: 032] s/backup::host/role::backup::host as per a recent change. [puppet] - 10https://gerrit.wikimedia.org/r/168334 (owner: 10Andrew Bogott) [17:35:39] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [17:36:18] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [17:37:35] (03CR) 10Alexandros Kosiaris: [C: 032] Make ganglia::web trusty aware and puppetized [puppet] - 10https://gerrit.wikimedia.org/r/168302 (owner: 10Alexandros Kosiaris) [17:41:08] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:42:31] marktraceur: see tgr's comment above [17:42:39] oh, bd808 already did [17:48:33] greg-g: so, I am bailing out from the deployment, everything is in clean state as far as I can see, the extension works on testwiki, it is enabled on group0 wikis but that does not seem to work (not active on mw.org) [17:50:58] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 1 failures [17:51:14] (03PS4) 10RobH: give jsahleen access to bastion and private logs [puppet] - 10https://gerrit.wikimedia.org/r/167627 (owner: 10Dzahn) [17:52:37] Reedy: Wasn't there some file you needed to touch and sync if you changed dblist files? Could we need to do that for the group0.dblist that may not have been used to this point? [17:53:27] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Puppet has 1 failures [17:55:10] marktraceur: sorry? [17:55:20] (03PS6) 10Ottomata: Enable GELF for MRAppManager part 2 [puppet] - 10https://gerrit.wikimedia.org/r/167044 (owner: 10Gage) [17:55:46] Reedy: see tgr's messages re the extension not showing up on mw.org [17:55:47] Reedy: When you change a dblist, you have to sync the dblist, but don't you also have to sync CommonSettings.php or something? [17:55:59] Could that be necessary now that tgr is using group0.dblist? [17:56:09] (03CR) 10RobH: [C: 032] give jsahleen access to bastion and private logs [puppet] - 10https://gerrit.wikimedia.org/r/167627 (owner: 10Dzahn) [17:56:10] Is group0 actually defined as a tag? [17:56:12] I don't think it is [17:56:19] It's just used for administrivia [17:56:29] <^d> None of the group# are a dblist. [17:56:34] yeah, it's not [17:56:35] foreach ( array( 'private', 'fishbowl', 'special', 'closed', 'flaggedrevs', 'small', 'medium', [17:56:35] 'large', 'wikimania', 'wikidata', 'wikidataclient', 'mediaviewer', 'visualeditor', [17:56:35] 'visualeditor-default', 'echowikis', 'commonsuploads', 'nonbetafeatures' ) as $tag ) { [17:56:36] <^d> Or a tag based on dblist. [17:56:42] Aha. [17:56:48] So tgr needs to add it to that array [17:56:51] you can just add group0 there... [17:56:52] yeah [17:56:58] Because there's a group0.dblist file, it's just not indexed [17:56:59] BUT [17:57:05] <^d> Just what we need, more dblists. [17:57:07] It's too late, because the train leaves the station in three minutes [17:57:20] tgr: Read above? [17:57:54] ^d: group0 has been around a little while. I just used it for ease of changing the rest of the group [17:58:09] <^d> yeah yeah :p [17:58:16] Reedy: does 'group0' => true on the conf file cause any problems? [17:58:26] tgr: I don't think so, it just won't work [17:58:28] if not, I'll just leave it as is for now [17:58:42] I'm running into the next window anyway [17:58:49] it's my window [17:58:56] I've not branched or anything yet [17:59:04] So won't start deploying for 10-15 minutes [17:59:05] (03CR) 10Ottomata: [C: 032] Enable GELF for MRAppManager part 2 [puppet] - 10https://gerrit.wikimedia.org/r/167044 (owner: 10Gage) [17:59:15] ok, thanks, I'll finish then [17:59:28] I only need to do config changes at this point [18:00:04] Reedy, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141023T1800). [18:00:58] greg-g: I am skipping the group0/group1 steps then, the extension works fine on testwiki [18:01:06] (03CR) 10Gergő Tisza: [C: 032] "Wheee." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168329 (owner: 10Gergő Tisza) [18:01:21] tgr: kk, sorry about the confusion [18:01:55] (03Merged) 10jenkins-bot: Enable ImageMetrics extension on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168329 (owner: 10Gergő Tisza) [18:02:10] (03PS1) 10coren: Tool Labs: final(?) fixes to gridengine III [puppet] - 10https://gerrit.wikimedia.org/r/168346 [18:03:08] !log tgr Synchronized wmf-config/InitialiseSettings.php: Enable ImageMetrics on all wikis (duration: 00m 05s) [18:03:13] Logged the message, Master [18:03:54] private wikis too? [18:04:24] (03PS1) 10Alexandros Kosiaris: Rename Openstreetmap's ganglia.py to osm.py [puppet] - 10https://gerrit.wikimedia.org/r/168348 [18:04:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Rename Openstreetmap's ganglia.py to osm.py [puppet] - 10https://gerrit.wikimedia.org/r/168348 (owner: 10Alexandros Kosiaris) [18:05:15] tgr: does this need to be installed on private wikis? or places like loginwiki? [18:05:58] (03CR) 10Ottomata: "Nuria, check this out, +1 it and I will merge it." [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 (owner: 10Mforns) [18:06:08] legoktm: it definitely doesn't *need* to [18:06:18] is it problematic, though? [18:06:28] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [18:06:48] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:07:10] well, logging things about users on private wikis is a bit iffy to me... [18:07:24] and on loginwiki, we typically disable any extension that doesn't *need* to be enabled there [18:07:28] I can push a fix to disable there if Reedy can spare another ten minutes [18:08:28] legoktm: where should it be disabled? private, vote, login? [18:08:51] that sounds good to me [18:09:42] (03PS1) 10Gergő Tisza: disable ImageMetrics on non-public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168350 [18:10:09] (03CR) 10Gergő Tisza: [C: 032] disable ImageMetrics on non-public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168350 (owner: 10Gergő Tisza) [18:10:16] * Coren "patiently" waits for jenkins to get to his changeset. [18:12:51] That is, not patiently at all but since there is no manual hand crank anywhere I gots to wait anyways. :-) [18:13:48] * ebernhardson glares at jenkins hoping it speeds things up [18:13:56] * Coren offers prime alfalfa to the hamsters running in Jenkin's wheels. [18:14:03] "Run faster, dammit!" [18:16:23] it does seem especially slow today [18:17:14] scrolling over zuul, it only "looks" like 2 actual jobs are running right now, but that screen might just be a lie [18:18:23] I once installed an actual, physical (toy) hand crank to a build server in a previous job; when people whined that their builds were slow I'd tell them they were welcome to help by cranking it up. That's also the place where I had the two labeled LARTs. :-) [18:18:40] well nothing has exploded, but i found that my grok filter is failing. tracking that down.. [18:18:43] oop [18:18:45] (03Merged) 10jenkins-bot: disable ImageMetrics on non-public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168350 (owner: 10Gergő Tisza) [18:19:58] !log tgr Synchronized wmf-config/InitialiseSettings.php: Disable ImageMetrics on non-public wikis (duration: 00m 05s) [18:20:07] Logged the message, Master [18:20:10] legoktm: thanks for the warning [18:20:10] (03PS1) 10Andrew Bogott: Moved neutron- classes into a neutron subdir and renamed. [puppet] - 10https://gerrit.wikimedia.org/r/168353 [18:20:12] (03PS1) 10Andrew Bogott: Moved glance-service into glance subdir, renamed glance::service [puppet] - 10https://gerrit.wikimedia.org/r/168354 [18:20:23] Reedy: I'm donw, sorry for the delay [18:20:29] *done* [18:21:04] tgr: :) [18:25:23] * Coren pokes Jenkins. [18:25:36] (03CR) 10Dzahn: [C: 032] blog Apache config - remove SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/168132 (owner: 10Dzahn) [18:27:21] Hey look! Poking Jenkins worked! :-) [18:27:35] (03CR) 10coren: [C: 032] Tool Labs: final(?) fixes to gridengine III [puppet] - 10https://gerrit.wikimedia.org/r/168346 (owner: 10coren) [18:29:37] (03PS1) 10coren: Tool Labs: really final(??) gridengine fix [puppet] - 10https://gerrit.wikimedia.org/r/168359 [18:29:47] Oh bah, I had to still have another bug left. [18:30:48] always happens, it's triggered by the word final :) [18:31:15] The sad thing is that 99% of those bugs would have been caught by the puppet compiler. :-( [18:32:16] that last one looks like it wouldn't have though :) [18:32:22] content vs. config [18:33:10] It would have because config is mandatory and content doesn't exist, so "Invalid parameter content at /etc/puppet/modules/gridengine/manifests/collectors/queues.pp:11" [18:33:27] ah, i see.. nod [18:34:45] Also, puppetizing gridengine is a pain. At least I managed to make the manifests themselves relatively clean even if the underlying mechanism is a monstrous hack. [18:42:25] (03CR) 10Andrew Bogott: [C: 032] Moved neutron- classes into a neutron subdir and renamed. [puppet] - 10https://gerrit.wikimedia.org/r/168353 (owner: 10Andrew Bogott) [18:45:25] (03PS1) 10Reedy: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168368 [18:45:27] (03PS1) 10Reedy: testwiki to 1.25wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168369 [18:45:29] (03PS1) 10Reedy: wikipedias to 1.25wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168370 [18:45:31] (03PS1) 10Reedy: group0 to 1.25wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168371 [18:45:45] (03CR) 10Andrew Bogott: [C: 032] Moved glance-service into glance subdir, renamed glance::service [puppet] - 10https://gerrit.wikimedia.org/r/168354 (owner: 10Andrew Bogott) [18:46:04] !log reedy Started scap: testwiki to 1.25wmf5 [18:46:10] Logged the message, Master [18:46:11] (03CR) 10coren: [C: 032] Tool Labs: really final(??) gridengine fix [puppet] - 10https://gerrit.wikimedia.org/r/168359 (owner: 10coren) [18:48:09] PROBLEM - CI tmpfs disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 27 MB (5% inode=99%): [18:48:56] (03PS1) 10Andrew Bogott: Move openstack::compute-service to openstack::compute::service [puppet] - 10https://gerrit.wikimedia.org/r/168376 [18:49:08] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: puppet fail [18:49:32] (03PS1) 10KartikMistry: Beta: Remove pmtpa reference from comments [puppet] - 10https://gerrit.wikimedia.org/r/168377 [18:50:11] RECOVERY - CI tmpfs disk space on lanthanum is OK: DISK OK [18:53:28] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [18:53:57] lol [18:54:04] that error makes no sense :P [18:54:16] (03CR) 10Reedy: [C: 032] Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168368 (owner: 10Reedy) [18:54:25] (03Merged) 10jenkins-bot: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168368 (owner: 10Reedy) [18:54:35] (03CR) 10Reedy: [C: 032] testwiki to 1.25wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168369 (owner: 10Reedy) [18:54:42] (03Merged) 10jenkins-bot: testwiki to 1.25wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168369 (owner: 10Reedy) [18:55:38] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [18:56:13] (03CR) 10Hashar: [C: 031] Beta: Remove pmtpa reference from comments [puppet] - 10https://gerrit.wikimedia.org/r/168377 (owner: 10KartikMistry) [19:01:36] (03Abandoned) 10Andrew Bogott: Move openstack::compute-service to openstack::compute::service [puppet] - 10https://gerrit.wikimedia.org/r/168376 (owner: 10Andrew Bogott) [19:10:11] (03PS1) 10coren: Tool Labs: More bugfixes to gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/168384 [19:14:39] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:15:12] (03CR) 10coren: [C: 032] Tool Labs: More bugfixes to gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/168384 (owner: 10coren) [19:18:59] !log reedy Finished scap: testwiki to 1.25wmf5 (duration: 32m 55s) [19:19:05] Logged the message, Master [19:19:40] bleugh [19:19:44] James_F: about? [19:20:04] Do you know what/where the "Ask your question" thing on testwiki comes from? [19:20:16] sounds like teahouse [19:20:18] oh, teahouse [19:20:19] yeah [19:20:38] Ask your question [19:20:41] stupid title :( [19:20:42] * aude not involved but heard about it [19:22:37] bug filed, thanks [19:22:41] sure [19:24:00] (03CR) 10Reedy: [C: 032] wikipedias to 1.25wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168370 (owner: 10Reedy) [19:24:07] (03Merged) 10jenkins-bot: wikipedias to 1.25wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168370 (owner: 10Reedy) [19:24:26] Hmm [19:24:38] Is it actually the extension.. That's not deployed is it? :/ [19:24:49] it's the extension deployed as a gadget [19:24:51] https://test.wikipedia.org/w/index.php?title=MediaWiki:Gadget-teahouse/teahouse.css&action=history [19:24:56] lol [19:24:56] scary [19:26:13] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf4 [19:26:22] Logged the message, Master [19:26:58] (03PS1) 10coren: Tool Labs: remove unnecessary dependencies [puppet] - 10https://gerrit.wikimedia.org/r/168390 [19:28:20] Reedy: really? [19:28:38] (03CR) 10Reedy: [C: 032] group0 to 1.25wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168371 (owner: 10Reedy) [19:28:45] (03Merged) 10jenkins-bot: group0 to 1.25wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168371 (owner: 10Reedy) [19:29:09] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf5 [19:29:17] Logged the message, Master [19:30:06] (03PS1) 10Rush: phab serve repos from /srv [puppet] - 10https://gerrit.wikimedia.org/r/168391 [19:30:18] (03CR) 10coren: [C: 032] "+trivial" [puppet] - 10https://gerrit.wikimedia.org/r/168390 (owner: 10coren) [19:36:09] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [19:46:30] (03CR) 10Dzahn: [C: 032] Beta: Remove pmtpa reference from comments [puppet] - 10https://gerrit.wikimedia.org/r/168377 (owner: 10KartikMistry) [20:00:04] gwicke, cscott, subbu: Dear anthropoid, the time has come. Please deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141023T2000). [20:00:25] whaat? how did we get a thursday slot? cscott is that for ocg? [20:06:23] (03PS1) 10Andrew Bogott: File nova services in a 'nova' subdir and namespace. [puppet] - 10https://gerrit.wikimedia.org/r/168405 [20:06:32] (03PS1) 10coren: Tool Labs: final fixes to gridengine III [puppet] - 10https://gerrit.wikimedia.org/r/168406 [20:07:13] (03CR) 10jenkins-bot: [V: 04-1] Tool Labs: final fixes to gridengine III [puppet] - 10https://gerrit.wikimedia.org/r/168406 (owner: 10coren) [20:08:23] (03CR) 10Andrew Bogott: [C: 032] File nova services in a 'nova' subdir and namespace. [puppet] - 10https://gerrit.wikimedia.org/r/168405 (owner: 10Andrew Bogott) [20:09:20] (03PS2) 10coren: Tool Labs: final fixes to gridengine III [puppet] - 10https://gerrit.wikimedia.org/r/168406 [20:09:22] (03CR) 10Nuria: "Still, some more work to do. I believe wikimetrics user needs GRANTs on centralauth db. Please see create.pp" [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 (owner: 10Mforns) [20:11:02] (03PS1) 10Andrew Bogott: Rename some files so the puppet loader can find them [puppet] - 10https://gerrit.wikimedia.org/r/168408 [20:11:54] (03CR) 10Andrew Bogott: [C: 032] Rename some files so the puppet loader can find them [puppet] - 10https://gerrit.wikimedia.org/r/168408 (owner: 10Andrew Bogott) [20:14:18] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:16:33] (03PS1) 10Andrew Bogott: Move keystone-service to keystone::service [puppet] - 10https://gerrit.wikimedia.org/r/168411 [20:19:02] (03CR) 10Andrew Bogott: [C: 032] Move keystone-service to keystone::service [puppet] - 10https://gerrit.wikimedia.org/r/168411 (owner: 10Andrew Bogott) [20:28:25] subbu|busy, jouncebot: no, i don't know anything about a thursday slot. [20:29:32] subbu|busy: we have the slot assigned next thursday, too. weird. [20:30:30] Hi. Could you close RT #8717? The issue is now solved. Description pages prints the correct thumbnail image URL. [20:31:09] greg-g, do you know anything about the Thu slot? [20:31:43] (I would say it has been printed correctly URLs for 1 hour) [20:32:08] cscott, or, maybe i missed some email about changes to deploy windows .. i now vaguely have a memory of greg-g sending an email about upcoming changes to deployments. [20:32:21] subbu: we've apparently been given thursday slots every week since oct 9 [20:32:52] subbu: maybe it just got added by mistake that week and got copy-and-pasted to every week since them? [20:33:17] there was actually a parsoid deployment on thursday oct 9: https://www.mediawiki.org/wiki/Parsoid/Deployments#Thursday.2C_Oct_9.2C_2014_around_1pm_PST:_Y_Deployed_644071d2 [20:33:19] oh ... ha .. that maybe it. we did that deploy on thu oct 9 to fix something. [20:33:33] i added a slot that day after checking with greg-g [20:33:50] so, that is probably what happened. copied to every week since. [20:33:51] let's remove the windows from oct 23 and oct 30 and maybe that will fix it. [20:33:55] k [20:34:08] you want to do it, or shall i? [20:34:39] (we need better real-time collab features on wikis!) [20:35:30] just fyi.. mathoid is now in production [20:35:36] woot! [20:35:48] it doesn't use texvcjs yet though, does it? [20:35:56] it does, I hooked that up [20:36:01] woot! [20:36:04] you are also listed in the announcement [20:36:13] i feel famous ;) [20:36:25] credit where credit is due ;) [20:36:32] sml is dead, long live sml [20:37:13] sml? i thought it was ocaml. [20:37:16] or was it ocaml? one of those *ml [20:37:32] it was ocaml [20:37:43] yeah, that one. cscott === caml killer [20:41:30] that code was my first contact with functional programming back in 2004 [20:42:26] also, it's still alive; the PNG mode is still the default [20:42:35] but it's the beginning of the end [20:45:39] (03PS1) 10Manybubbles: Split cirrus's pool counters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168424 [20:46:32] (03CR) 10Manybubbles: "I'm not sure if it is a good idea to double the amount of in progress stuff but we should segment enwiki from other wikis if possible." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168424 (owner: 10Manybubbles) [21:05:18] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [21:10:18] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 120 seconds ago with 0 failures [21:10:48] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 348 seconds [21:10:59] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 364 seconds [21:15:18] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:19:18] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [21:19:48] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.137 [21:24:41] (03PS1) 10Andrew Bogott: Add virt1006 back to the compute pool. [puppet] - 10https://gerrit.wikimedia.org/r/168458 [21:26:19] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [21:26:27] (03CR) 10Andrew Bogott: [C: 032] Add virt1006 back to the compute pool. [puppet] - 10https://gerrit.wikimedia.org/r/168458 (owner: 10Andrew Bogott) [21:27:05] https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/ blocked on waiting for deployment-bastion.eqiad [21:27:12] Any thoughts [21:27:12] ? [21:28:20] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.137 [21:28:20] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [21:30:31] hashar: see marktraceur above [21:33:56] marktraceur: bah [21:34:34] Agreed [21:34:37] marktraceur: seems deployment-bastion.eqiad Jenkins slave is considered to no more have any executor slot [21:34:42] marktraceur: so no build is realized on it :/ [21:35:53] !log Jenkins: disconnected / reconnected slave node deployment-bastion.eqiad [21:36:01] Logged the message, Master [21:39:32] Yay [21:40:25] no :( [21:40:49] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay -1 seconds [21:41:39] (03CR) 10Ottomata: "Marcell and I just discussed this. The vagrant wikimetrics role class needs to require ::role::centralauth, and it needs to use the value" [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 (owner: 10Mforns) [21:41:53] No? [21:43:12] marktraceur: yeah there is no slot available on that slave :( [21:43:21] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 9 failures [21:43:28] !log Jenkins: disabling / reenabling Gearman plugin [21:43:37] Logged the message, Master [21:43:47] Curses [21:46:30] !log springle Synchronized wmf-config/db-eqiad.php: depool db1018 (duration: 00m 06s) [21:46:37] Logged the message, Master [21:49:11] (03PS1) 10BBlack: vhtcpd (0.0.10-3) unstable; urgency=low [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/168494 [21:49:43] (03CR) 10BBlack: [C: 032 V: 032] vhtcpd (0.0.10-3) unstable; urgency=low [software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/168494 (owner: 10BBlack) [21:50:31] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:54:23] !log Jenkins the Gearman plugin is holding a lock on deployment-bastion slave that prevents it from running any job :-/ [21:54:30] Logged the message, Master [21:56:30] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [21:57:54] marktraceur: it is broken somehow :-/ Definitely an issue in the Jenkins Gearman plugin [21:58:07] Hrmph [21:58:08] OK [21:58:17] I'll assume my patch is going to work and test it on testwiki. [22:04:41] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [22:06:34] (03CR) 10coren: "♪ One more time / We're gonna celebrate ♪" [puppet] - 10https://gerrit.wikimedia.org/r/168406 (owner: 10coren) [22:06:43] (03CR) 10coren: [C: 032] "♪ One more time / We're gonna celebrate ♪" [puppet] - 10https://gerrit.wikimedia.org/r/168406 (owner: 10coren) [22:11:36] * Coren cries. [22:15:20] (03PS1) 10coren: Tool Labs: rm obsolete gridengine::resource param [puppet] - 10https://gerrit.wikimedia.org/r/168497 [22:18:00] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:18:18] (03CR) 10coren: [C: 032] Tool Labs: rm obsolete gridengine::resource param [puppet] - 10https://gerrit.wikimedia.org/r/168497 (owner: 10coren) [22:28:29] !log preparing jenkins for restart [22:28:36] marktraceur: gotta restart Jenkins out of luck [22:28:37] Logged the message, Master [22:29:09] (03PS1) 10Calak: Add "templateeditor" user group to $wgRestrictionLevels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168498 (https://bugzilla.wikimedia.org/72146) [22:31:28] (03PS2) 10Calak: Add "templateeditor" user group to $wgRestrictionLevels on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168498 (https://bugzilla.wikimedia.org/72146) [22:35:59] !log Jenkins restarting [22:36:09] Logged the message, Master [22:38:33] (03PS1) 10BBlack: Remove debian/ directory from master branch [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/168504 [22:39:21] (03CR) 10BBlack: [C: 032 V: 032] Remove debian/ directory from master branch [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/168504 (owner: 10BBlack) [22:39:52] (03PS1) 10BBlack: Merge tag '1.0.3' into debian [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/168505 [22:41:11] (03CR) 10BBlack: [C: 032 V: 032] Merge tag '1.0.3' into debian [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/168505 (owner: 10BBlack) [22:41:50] RECOVERY - ElasticSearch health check on logstash1002 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 41: active_shards: 113: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [22:42:19] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 41: active_shards: 113: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [22:42:40] RECOVERY - ElasticSearch health check on logstash1003 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 41: active_shards: 113: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [22:42:50] !log Jenkins is all good now. [22:42:58] Logged the message, Master [22:43:22] marktraceur: solved now :/ [22:43:52] crashing to bed, see you tomorrow [22:46:38] (03PS1) 10coren: Tool Labs: Tweaks to gridengine puppetization [puppet] - 10https://gerrit.wikimedia.org/r/168507 [22:47:14] (03PS1) 10BBlack: varnishkafka (1.0.3-1) precise; urgency=low [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/168508 [22:47:31] (03CR) 10coren: [C: 032] Tool Labs: Tweaks to gridengine puppetization [puppet] - 10https://gerrit.wikimedia.org/r/168507 (owner: 10coren) [22:48:30] (03PS1) 1020after4: preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 [22:48:36] (03CR) 10jenkins-bot: [V: 04-1] preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 (owner: 1020after4) [22:48:39] (03CR) 10BBlack: [C: 032 V: 032] varnishkafka (1.0.3-1) precise; urgency=low [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/168508 (owner: 10BBlack) [22:49:23] (03PS2) 1020after4: preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 [22:50:03] (03CR) 10jenkins-bot: [V: 04-1] preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 (owner: 1020after4) [22:50:54] (03PS3) 1020after4: preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 [22:51:00] (03CR) 10jenkins-bot: [V: 04-1] preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 (owner: 1020after4) [22:51:05] (03PS4) 1020after4: preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 [22:51:47] (03CR) 10jenkins-bot: [V: 04-1] preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 (owner: 1020after4) [22:52:08] (03PS5) 1020after4: preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 [22:52:13] (03CR) 10jenkins-bot: [V: 04-1] preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 (owner: 1020after4) [22:52:16] (03PS6) 1020after4: preamble script to read client address from HTTP_X_FORWARDED_FOR [puppet] - 10https://gerrit.wikimedia.org/r/168509 [22:57:32] I'm going to wind up doing SWAT I guess [22:57:50] A la Brad. [22:58:44] oh crap [22:58:56] * legoktm puts patches on the swat list [23:00:04] RoanKattouw, ^d, marktraceur, MaxSem, marktraceur: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141023T2300). Please do the needful. [23:00:59] Yeeah [23:01:04] Can I add things too? [23:01:10] I just added 2 [23:01:11] Well fine! [23:01:12] I volunteer to do the entire SWAT if that helps [23:01:19] RoanKattouw: How can I argue with that [23:01:26] * marktraceur is still here to supervise his patches [23:14:34] RoanKattouw: How's it coming? [23:15:09] I got distracted sorry [23:15:14] I will start doing you guys' patches now [23:15:19] My patch may or may not actually happen [23:15:29] OK then :) [23:16:43] In fact it will not happen [23:17:03] Aw. [23:20:26] aaand now we wait. [23:27:46] !log catrope Synchronized php-1.25wmf4/includes/api/ApiFormatBase.php: SWAT (duration: 00m 04s) [23:27:56] Logged the message, Master [23:28:12] !log catrope Synchronized php-1.25wmf4/extensions/TimedMediaHandler/: SWAT (duration: 00m 04s) [23:28:18] Logged the message, Master [23:28:48] !log catrope Synchronized php-1.25wmf4/extensions/UploadWizard/: SWAT (duration: 00m 04s) [23:28:53] Logged the message, Master [23:29:33] !log catrope Synchronized php-1.25wmf5/extensions/UploadWizard/: SWAT (duration: 00m 06s) [23:29:39] Logged the message, Master [23:29:45] !log catrope Synchronized php-1.25wmf5/extensions/TimedMediaHandler/: SWAT (duration: 00m 04s) [23:29:50] Logged the message, Master [23:30:34] Trying to find a video to test with [23:30:53] https://commons.wikimedia.org/wiki/File:Capturing-structure-and-function-in-an-embryonic-heart-with-biophotonic-tools-Movie1.ogv [23:31:35] ty Reedy [23:31:45] marktraceur: Those are yours done ---^^ [23:31:53] Now doing patches for legoktm's [23:32:01] Yup, testing [23:32:09] TMH looks friggin' beautifule [23:32:14] beautiful [23:33:07] UW looking very sexy too [23:33:14] gj RoanKattouw [23:33:24] Commons thanks you [23:38:20] !log catrope Synchronized php-1.25wmf4/extensions/AntiSpoof/: SWAT (duration: 00m 06s) [23:38:28] Logged the message, Master [23:39:24] !log catrope Synchronized php-1.25wmf4/extensions/CentralAuth/: SWAT (duration: 00m 04s) [23:39:33] Logged the message, Master [23:39:40] legoktm: OK done please verify [23:39:45] I'll be in a meeting but will respond to pings [23:40:04] RoanKattouw: it's just a maint script, so I'll assume the code got synced out properly [23:41:05] thanks though! [23:50:57] bd808: you around? [23:51:19] dr0ptp4kt: for a few more minutes. What's up? [23:51:41] bd808: wanted to quick ask you about local wikis in vagrant and wgLocalDatabases in prod [23:51:54] bd808: able to do a video call real quickly? [23:52:23] We can try. I have to leave for dinner in ~15 minutes [23:52:35] bd808: cool, ok, i'll call on ghangout