[00:00:07] (03PS1) 10Dzahn: add a README.md for module diamond [operations/puppet] - 10https://gerrit.wikimedia.org/r/132129 [00:03:49] (03CR) 10Dzahn: [C: 032] add a README.md for module diamond [operations/puppet] - 10https://gerrit.wikimedia.org/r/132129 (owner: 10Dzahn) [00:21:01] (03PS1) 10Dzahn: role class for diamond, move generic into init [operations/puppet] - 10https://gerrit.wikimedia.org/r/132131 [00:21:50] (03PS2) 10Dzahn: role class for diamond, move generic into init [operations/puppet] - 10https://gerrit.wikimedia.org/r/132131 [00:21:51] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [00:34:34] mutante: Can you quickly push https://gerrit.wikimedia.org/r/132128 ? Jenkins doesn't seem motivated to do it atm [00:38:01] (03CR) 10Dzahn: [C: 032] Revert "include admins::mortals on osmium, to allow MediaWiki deployments" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132128 (owner: 10Hoo man) [00:38:03] hoo: oh, didn't realize it wasn't, yep [00:38:37] thx :) [00:39:46] it's about the order of things, jenkins has to come first, then the human [00:43:39] (03PS1) 10Dr0ptp4kt: Prepare log for zeromemcache state change logging. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132134 [00:54:51] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [00:55:41] (03PS1) 10Dzahn: make role/coredb more readable [operations/puppet] - 10https://gerrit.wikimedia.org/r/132135 [01:11:48] (03PS1) 10Dzahn: delete mwlib.pp? (pediapress) move to pdf/ocg? [operations/puppet] - 10https://gerrit.wikimedia.org/r/132136 [01:22:51] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [01:49:10] (03PS2) 10Springle: make role/coredb more readable [operations/puppet] - 10https://gerrit.wikimedia.org/r/132135 (owner: 10Dzahn) [01:51:43] (03CR) 10Springle: [C: 032] "Merging this now so I can watch for puppet-run fallout :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132135 (owner: 10Dzahn) [02:11:21] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3788 MB (3% inode=99%): [02:18:21] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3430 MB (3% inode=99%): [02:29:20] !log LocalisationUpdate completed (1.24wmf2) at 2014-05-08 02:28:17+00:00 [02:29:29] Logged the message, Master [02:35:51] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 9 below the confidence bounds [02:39:51] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 9 below the confidence bounds [02:57:06] !log LocalisationUpdate completed (1.24wmf3) at 2014-05-08 02:56:02+00:00 [02:57:13] Logged the message, Master [03:00:21] RECOVERY - Disk space on virt0 is OK: DISK OK [03:08:55] andrewbogott_afk: Thanks for the staff photo uploads. :-) [03:44:45] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu May 8 03:43:39 UTC 2014 (duration 43m 38s) [03:44:53] Logged the message, Master [03:55:51] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [04:37:51] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [05:12:51] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [05:21:51] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [06:00:20] (03PS2) 10Springle: Useful to have one slave CNAME for m1, m2, and x1 shards [operations/dns] - 10https://gerrit.wikimedia.org/r/131421 [06:00:38] (03CR) 10Springle: [C: 032] Useful to have one slave CNAME for m1, m2, and x1 shards [operations/dns] - 10https://gerrit.wikimedia.org/r/131421 (owner: 10Springle) [06:39:51] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [06:56:51] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [07:07:51] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [07:21:31] PROBLEM - HTTP on carbon is CRITICAL: Connection refused [07:23:31] RECOVERY - HTTP on carbon is OK: HTTP OK: HTTP/1.1 200 OK - 232 bytes in 0.002 second response time [07:26:04] (03PS2) 10Faidon Liambotis: install-server: enable IPv6 on lighttpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/130856 (owner: 10Ottomata) [07:26:31] PROBLEM - HTTP on carbon is CRITICAL: Connection refused [07:29:31] RECOVERY - HTTP on carbon is OK: HTTP OK: HTTP/1.1 200 OK - 232 bytes in 0.002 second response time [07:29:45] (03PS3) 10Faidon Liambotis: install-server: enable IPv6 on lighttpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/130856 (owner: 10Ottomata) [07:30:24] (03CR) 10Faidon Liambotis: [C: 032] install-server: enable IPv6 on lighttpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/130856 (owner: 10Ottomata) [07:31:22] (03CR) 10Faidon Liambotis: [V: 032] install-server: enable IPv6 on lighttpd [operations/puppet] - 10https://gerrit.wikimedia.org/r/130856 (owner: 10Ottomata) [07:32:38] (03CR) 10Gilles: [C: 031] Throttle GWToolset uploads [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132112 (owner: 10Gergő Tisza) [07:32:56] Thanks paravoid [07:45:47] (03PS5) 10Giuseppe Lavagetto: Adding ability to compute change-based diffs. [operations/software] - 10https://gerrit.wikimedia.org/r/131495 [07:49:52] (03CR) 10Giuseppe Lavagetto: [C: 032] Adding ability to compute change-based diffs. [operations/software] - 10https://gerrit.wikimedia.org/r/131495 (owner: 10Giuseppe Lavagetto) [07:52:42] (03PS6) 10Giuseppe Lavagetto: Move cluster definition to the node level. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130591 [07:53:27] (03CR) 10Giuseppe Lavagetto: [C: 032] "All compilation tests show no difference." [operations/puppet] - 10https://gerrit.wikimedia.org/r/130591 (owner: 10Giuseppe Lavagetto) [08:01:05] !log reedy updated /a/common to {{Gerrit|I7f2d2b25d}}: Allow all users on OfficeWiki to send mass messages [08:01:12] Logged the message, Master [08:03:19] (03PS1) 10Reedy: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132152 [08:03:21] (03PS1) 10Reedy: testwiki to 1.24wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132153 [08:03:23] (03PS1) 10Reedy: Wikipedias to 1.24wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132154 [08:03:25] (03PS1) 10Reedy: group0 to 1.24wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132155 [08:03:45] (03CR) 10Reedy: [C: 032] Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132152 (owner: 10Reedy) [08:03:52] (03Merged) 10jenkins-bot: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132152 (owner: 10Reedy) [08:08:30] That's an early deployment. [08:10:18] I'm driving for the next 6 hours or so [08:10:26] So gets the prep work done [08:18:50] !log reedy synchronized php-1.24wmf4 'staging' [08:18:58] Logged the message, Master [08:22:30] Reedy: what is the path of mw code on tin ? [08:27:51] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [08:32:53] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [08:33:18] mutante: /a/common [08:33:23] matanya: : /a/common [08:33:37] thanks Reedy [08:34:17] Reedy: the question came from: i want to read a cron job that will show what is in fact deployed in producation [08:34:46] running git commands on that path and exporting it should work, i guess [08:37:37] !log reedy synchronized docroot and w [08:37:37] Logged the message, Master [08:37:38] matanya: reedy@tin:/a/common$ /a/common/multiversion/activeMWVersions [08:37:38] 1.24wmf2 1.24wmf3 [08:38:56] Reedy: i mean specific commits . e.g : https://gerrit.wikimedia.org/r/#/c/131011/ [08:39:10] Ah [08:39:11] how can one know if that commit is in fact on prod [08:39:22] the only way is to find the backport [08:39:35] I think you can do it with a git one liner [08:39:49] what would be the line ? [08:39:53] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [08:40:21] so i thought about a cron job that runs and shows what commits are in fact deployed [08:40:23] git show hash? [08:40:55] We do have https://www.mediawiki.org/wiki/MediaWiki_1.24/wmf3/Changelog [08:41:00] But it is only updated manually currently [08:41:16] yes [08:41:33] greg said it is useless once the version is out [08:43:37] Why is it? [08:43:49] I tend to update the older versions when I branch the next version [08:44:10] (change visibility) 15:36, 1 May 2014 (diff | hist) . . (+21,754)‎ . . N MediaWiki 1.24/wmf3/Changelog ‎ (Update changelog for wmf/1.24wmf3) (current) [08:44:10] (change visibility) 15:36, 1 May 2014 (diff | hist) . . (+4,427)‎ . . MediaWiki 1.24/wmf2/Changelog ‎ (Update changelog for wmf/1.24wmf2) (current) [08:44:10] (change visibility) 15:36, 1 May 2014 (diff | hist) . . (+2,263)‎ . . MediaWiki 1.24/wmf1/Changelog ‎ (Update changelog for wmf/1.24wmf1) (current) [08:49:02] fetch [08:49:05] checkout branch [08:49:06] pull [08:49:10] update all submodules [08:49:16] run upload changelog script [08:49:19] repeat for other branches [08:50:41] <_joe_> matanya: https://wikitech.wikimedia.org/wiki/Puppet_migration [08:55:59] !log installed db106[45] [08:56:06] Logged the message, Master [09:22:19] (03PS1) 10Springle: deploy db1064 to s4, db1065 to s1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/132171 [09:25:04] (03CR) 10Springle: [C: 032] deploy db1064 to s4, db1065 to s1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/132171 (owner: 10Springle) [09:28:24] (03CR) 10Giuseppe Lavagetto: [C: 032] "It seems we will use this class in the future, so merging." [operations/puppet] - 10https://gerrit.wikimedia.org/r/120518 (owner: 10Matanya) [09:28:32] (03PS4) 10Giuseppe Lavagetto: mha: fix var scope search [operations/puppet] - 10https://gerrit.wikimedia.org/r/120518 (owner: 10Matanya) [09:28:45] Reedy: sorry, was away. greg said it is useless because some changes get reverted and some get backported [09:32:27] nice job _joe_ [09:33:07] <_joe_> matanya: I'm going to iron out the last few *big* issues than we should move to fixing templates [09:33:25] sure, poke when you need me [09:33:26] <_joe_> it's going to be painful and slow but it shouldn't be a showstopper [09:33:40] <_joe_> meaning we can do that once we've migrated [09:34:22] <_joe_> one thing I did not think through is - how do we treat collected resources? I'm not sure if puppetdb is compatible between puppet 2 and puppet 3 [09:34:36] <_joe_> and I don't see anything on the net on this [09:34:44] hmmm [09:34:48] good question [09:35:04] PROBLEM - mysqld processes on db1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [09:35:07] _joe_: it is not [09:35:14] PROBLEM - mysqld processes on db1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [09:35:15] we can probably look at the activerecord to see how many migrations there are [09:35:21] we did that migration, wasn't fun [09:35:27] !log xtrabackup clone db1049 to db1064 [09:35:34] Logged the message, Master [09:35:54] <_joe_> matanya: ouch THAT is going to be a problem [09:36:04] ACKNOWLEDGEMENT - mysqld processes on db1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Sean Pringle Cloning... - The acknowledgement expires at: 2014-05-09 11:35:47. [09:36:39] ACKNOWLEDGEMENT - mysqld processes on db1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Sean Pringle Cloning... - The acknowledgement expires at: 2014-05-09 15:36:24. [09:37:56] _joe_: http://docs.puppetlabs.com/puppetdb/2.0/migrate.html <-- this helps ? [09:38:39] <_joe_> matanya: my problem is - we will be running some hosts on puppet 3 and some on puppet 2.7 for some time [09:38:55] <_joe_> and we need the collected resources to be visible to both [09:39:03] oh, that i didn't testg [09:39:15] we migrated all hosts at once [09:39:26] <_joe_> we can't do that here I think [09:39:36] why? [09:39:40] hardy ? [09:40:32] <_joe_> matanya: 1000's of hosts? the horrible downtime we may experience? [09:40:54] <_joe_> matanya: we will probably migrate 'canary' hosts first [09:41:24] oh, forgot, no "downtime" scheduled on the site :) [09:41:32] glad we can do that [09:53:15] (03PS1) 10Odder: Set $wgCategoryCollation to 'uca-lv' on lvwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132173 (https://bugzilla.wikimedia.org/65003) [09:57:34] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [10:04:43] ughm. [10:04:55] We are using a redirect as a logo for zhwikivoyage. [10:05:00] I wonder how much it hurts. [10:14:54] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [10:16:18] !log xtrabackup clone db1051 to db1065 [10:16:25] Logged the message, Master [10:18:58] !log springle synchronized wmf-config/db-eqiad.php 'reduce db1049 and db1051 load while cloning' [10:19:07] Logged the message, Master [10:22:54] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [10:23:24] PROBLEM - MySQL Processlist on db1051 is CRITICAL: CRIT 142 unauthenticated, 0 locked, 0 copy to table, 0 statistics [10:24:24] RECOVERY - MySQL Processlist on db1051 is OK: OK 5 unauthenticated, 0 locked, 0 copy to table, 0 statistics [10:25:27] hmm [10:34:47] (03PS1) 10Springle: warm up db1064 in s4, db1065 in s1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132183 [10:34:54] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [10:35:34] (03CR) 10Springle: [C: 04-2] warm up db1064 in s4, db1065 in s1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132183 (owner: 10Springle) [11:32:01] (03PS1) 10Giuseppe Lavagetto: Fix the use of $nagios_group. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132187 [11:47:08] (03PS1) 10Mark Bergsma: Add new codfw allocations, core router loopbacks & transfer nets [operations/dns] - 10https://gerrit.wikimedia.org/r/132188 [12:17:19] (03PS1) 10Giuseppe Lavagetto: Fix the compile script to work as declared. [operations/software] - 10https://gerrit.wikimedia.org/r/132191 [12:42:14] RECOVERY - mysqld processes on db1064 is OK: PROCS OK: 1 process with command name mysqld [12:44:44] PROBLEM - MySQL Replication Heartbeat on db1064 is CRITICAL: CRIT replication delay 3213 seconds [12:45:54] PROBLEM - MySQL Slave Delay on db1064 is CRITICAL: CRIT replication delay 2627 seconds [12:49:44] RECOVERY - MySQL Replication Heartbeat on db1064 is OK: OK replication delay 129 seconds [12:49:54] RECOVERY - MySQL Slave Delay on db1064 is OK: OK replication delay 88 seconds [12:56:17] hi manybubbles [12:56:28] matanya: hi! [12:56:35] ready to test. were and what should be done ? [12:56:57] matanya: sorry! last night I tried to say that I can't do hewiki until after the deploy train today [12:57:03] I can do any of the others, actually [12:57:16] if you think its worth it, I can do hewikisource [12:57:35] ok, i can wait. hewikisource in prod ? [12:57:47] yeah, it'd all be in prod [12:57:55] ok, i can di it [12:58:00] and cirrus is hewikisources' primary backend [12:58:19] * do. hewikisource isn't very small, so it is worthwhile [12:58:34] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [12:58:40] matanya: k. I can rebuild its index now [12:59:43] !log rebuilding cirrus index for hewikisource to pick up hebmorph [12:59:50] Logged the message, Master [13:06:04] RECOVERY - mysqld processes on db1065 is OK: PROCS OK: 1 process with command name mysqld [13:08:44] PROBLEM - MySQL Replication Heartbeat on db1065 is CRITICAL: CRIT replication delay 1602 seconds [13:09:27] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1065 is CRITICAL: CRIT replication delay 1602 seconds Sean Pringle catching up... - The acknowledgement expires at: 2014-05-09 14:09:11. [13:09:27] ACKNOWLEDGEMENT - MySQL Slave Delay on db1065 is CRITICAL: CRIT replication delay 1595 seconds Sean Pringle catching up... - The acknowledgement expires at: 2014-05-09 14:09:11. [13:12:40] (03PS2) 10Mark Bergsma: Add new codfw allocations, core router loopbacks & transfer nets [operations/dns] - 10https://gerrit.wikimedia.org/r/132188 [13:12:42] (03PS1) 10Mark Bergsma: Allocate codfw private IP space, create management network [operations/dns] - 10https://gerrit.wikimedia.org/r/132195 [13:14:38] * aude wonders if it's better for me to pull new wmf4 code on tin  [13:14:44] RECOVERY - MySQL Replication Heartbeat on db1065 is OK: OK replication delay -0 seconds [13:14:45] it's not deployed anywhere yet [13:18:24] (03CR) 10Aude: "@note: i merged https://gerrit.wikimedia.org/r/#/c/132194/ to update Wikidata and pulled in the change on tin." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132153 (owner: 10Reedy) [13:19:42] (03CR) 10Filippo Giunchedi: [C: 032] Fix the compile script to work as declared. [operations/software] - 10https://gerrit.wikimedia.org/r/132191 (owner: 10Giuseppe Lavagetto) [13:20:13] ^d: are you deploying today? [13:20:19] aude: ? [13:20:22] me? [13:20:31] in an hour and a half I've got some stuff going out [13:20:32] the general stuff [13:20:35] says chad and dan [13:21:03] if anyone is updating wmf4 before then, i pulled our change on tin but not sync [13:21:10] or should i sync, even if it's not deployed yet [13:21:31] aude: I'm pretty sure ^d will do the train today for wmf4 [13:21:36] but he's not likely to be really awake right now [13:21:39] ok [13:21:45] (03CR) 10Springle: [C: 032] warm up db1064 in s4, db1065 in s1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132183 (owner: 10Springle) [13:21:52] also our change requires scap, but scap is done anyway [13:21:53] he'll come and comment in an hour or so over breakfast, probably [13:22:01] aude: yeah [13:22:05] (03Merged) 10jenkins-bot: warm up db1064 in s4, db1065 in s1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132183 (owner: 10Springle) [13:22:18] (03CR) 10Aude: "also note, our change involves i18n changes so scap is needed (but i assume scap still needs to be done anyway)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132153 (owner: 10Reedy) [13:22:23] i left notes on https://gerrit.wikimedia.org/r/#/c/132153/ [13:22:48] * aude will be on a plane, but changes should be fine (and maybe hoo is around, maybe not) [13:23:33] if they are not fine, for some odd reason, feel free to revert oru submodule change (wmf4 is test wikidata / test2 only) [13:24:04] !log springle synchronized wmf-config/db-eqiad.php 'warm up db1064 in s4, db1065 in s1' [13:24:11] Logged the message, Master [13:40:56] I'm assuming greg-g is on a plane/already in Zurich? [13:41:16] aude: I'll remember that [13:41:22] twkozlowski: I believe so [13:41:39] manybubbles: https://gerrit.wikimedia.org/r/#/c/127584/ [13:41:43] deskana and ^d are managing today's train deploy [13:41:46] It's time this gets deployed. [13:43:01] twkozlowski: also believe so [13:43:08] thanks manybubbles [13:43:35] i think our changes in this branch vs last are pretty small this time [13:43:43] manybubbles: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=112033&oldid=112032 [13:44:05] manybubbles: The LangEng team is blocking this for no good reason, and then ignoring pings. [13:44:11] Pretty frustrating. [13:45:20] twkozlowski: they are probably travelling [13:45:35] twkozlowski: I have no idea [13:45:42] Since Saturday, aude? :-) [13:46:00] no idea [13:48:48] manybubbles: This is a simple addition of a namespace that is already enabled on lots and lots of other Wikisource projects. Why the Language Engineering team is blocking it, I have no idea. [13:49:12] I don't even know why they got involved in it in the first place; it is in no way connected to their work. [13:49:26] twkozlowski: fair enough but I'm not really smart enough to overrule them. [13:49:49] for the most part, as a SWAT deployer, I'm just hands [13:49:58] simple stuff for swat [13:50:40] * twkozlowski so disappointed at the bureaucracy. [13:51:36] I'm no fan of it either. [13:52:05] And the secrecy. [13:52:27] Greg told me Anasuya asked for the patch to be blocked, but she has never spoken aloud about it anywhere as far as I am aware. [13:52:59] There is something Seriously Broken about how this is being done. [13:54:26] twkozlowski: that sounds like it. like, really really broken, actually [13:55:33] because if the patch should be blocked then the SWAT deployers should know especially because it looks pretty sane [13:56:09] anomie: is there some mechanism we have for knowing things we _shouldn't_ SWAT deploy? [13:56:11] I'm seriously regret ever mentioning that it was being blocked in my comment on Gerrit. [13:56:23] s/I'm/I [13:56:43] twkozlowski: if you hadn't I have no idea how I'd have known anyone didn't want it. [13:56:53] Hadn't I mentioned it, it would've already been deployed, and the zhwikisource community could get on with their work. [13:57:25] And seriously, I'm trying to be helpful and commit a patch so that this community - with which I have zero connection - can continue their work [13:57:44] manybubbles: Are you referring to the thing yesterday with that zhwikisource namespace config change? The language team *should* have actually -1ed or -2ed the patch and stated their issue. greg-g is handling why that didn't get done. [13:57:55] twkozlowski: and if it wasn't being blocked for a good reason then we'd all just keep going on with our lives. if it was blocked for a good reason sadness. [13:58:02] anomie: ah [13:58:17] yeah, a -1/-2 would have stopped me dead [13:58:50] I suggest we please, please write down a rule that no secret discussion can block a patch from being deployed [13:58:52] manybubbles: Apparently there was some in-person discussion in SF about that patch but it never made it into Gerrit. [13:59:45] twkozlowski: yeah, let me poke greg. unforuntately he's the decider on the rules and he's traveling [14:00:02] Yes, that's why I asked if he were on a plane [14:00:08] <^d> so much ping [14:00:09] quite sure [14:00:11] As you saw, he left a comment on the bug asking for an explanation [14:00:13] hi ^d [14:00:14] <^d> morning folks. [14:00:27] ... which never came. [14:00:52] ^d: please note my notes / comments on https://gerrit.wikimedia.org/r/#/c/132153/ [14:00:57] ^d: [14:00:57] when you do deployments [14:01:00] ^d: [14:01:22] <^d> scappy scap [14:01:29] the stuff in our new wikidata code is actually pretty small changes so should be fine [14:01:41] but i will be on plane and not sure if ho-o is around then [14:01:54] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [14:02:22] What is really bothering me is that we are an online community, and that patch is blocked because someone told someone else something in an office, hereby excluding me from having any say in the discussion. [14:02:29] That's all, I'll shut up now. [14:02:33] * twkozlowski stalks off [14:08:38] (03CR) 10Manybubbles: "Its been four days since the deployment was halted but no one has posted comments or a code review. If you don't want this patch going to" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127584 (https://bugzilla.wikimedia.org/64127) (owner: 10Odder) [14:08:55] twkozlowski: ranted in gerrit [14:09:53] who's doing deployments during the hackathon? (just wondering) [14:10:08] MatmaRex: ^d is [14:10:17] matanya: finally don [14:10:20] *done* [14:10:28] running search [14:10:34] poor dude manning the post alone, eh? [14:10:47] MatmaRex: chad and dan [14:10:57] <^d> Don't have to talk like I'm dead or not here :p [14:11:19] ^d: :> [14:12:01] <^d> Who knows though, we might all die during today's deploy. Hope everyone brought their survival gear :p [14:14:27] _joe_, akosiaris, was it one of you who did the magic with puppet, trusty, and the ruby dependency? [14:19:34] ^d: I've got to head out - I'll be back in a while. I'm not sure I'll be around for the swat.... If so, maybe delay it.... [14:20:12] <^d> swat all the code. [14:20:51] <_joe_> andrewbogott: ?! [14:20:58] <_joe_> andrewbogott: I think akosiaris [14:21:06] <_joe_> whatever you are referring to :) [14:21:14] <_joe_> and akosiaris supposedly is on a plane now [14:21:34] _joe_: ok, fair enough -- is akosiaris coming to the hackathon? [14:21:41] If so I'll pester him when he gets here [14:22:06] <_joe_> he was in the list [14:22:27] andrewbogott: I suppose re: my last mail? if we already figured it out that'd be nice [14:23:19] godog: yeah, hoping that whatever magic akosiaris wrought for the client he can set up for the master as well. [14:24:39] andrewbogott: ye, see my mail to ops@ too [14:25:28] godog: your option (d) seems the simplest way forward for now, but that doesn't fix a puppetmaster, right? [14:26:00] (or is it possible to make the puppetmaster install not put ruby1.9 in and still work?) [14:26:21] godog: ok, I'll catch up (but will be very distracted in coming days) [14:28:45] bblack: yes that's right, I don't know though if we can have the master without any ruby 1.9 deps (rails?) [14:29:13] (03PS1) 10Mark Bergsma: Attempt to fix RANCID cron job [operations/puppet] - 10https://gerrit.wikimedia.org/r/132198 [14:29:14] godog: We could also just mark the puppetmaster::self classes so that they fail on trusty [14:29:25] although it sounds like there are other things that will conflict with the puppet client [14:30:31] andrewbogott: yeah, as it is now I think any package depending on ruby 1.9 will uninstall puppet 2.7, potentially a minefield [14:30:44] yeah :( [14:31:54] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [14:31:56] (03CR) 10Mark Bergsma: [C: 032] Attempt to fix RANCID cron job [operations/puppet] - 10https://gerrit.wikimedia.org/r/132198 (owner: 10Mark Bergsma) [14:32:36] <_joe_> godog: why do we need a puppet 2.7 master on trusty? [14:35:36] _joe_: potentially we don't, I think it'd be optimal to avoid 2.7 on trusty at all if we can do that [14:36:23] <_joe_> yes I agree [14:36:37] <_joe_> as per my email, the only risk is the puppetdb compatibility [14:36:47] <_joe_> I will dedicate tomorrow to this [14:38:23] <^d> sdehaan: Hi, Nik (manybubbles) stepped away for a bit, I think he should be back in the next 15-20 minutes. [14:38:31] <^d> (I'm Chad, by the way) [14:38:47] ^d: hi! [14:39:16] _joe_, godog, we have to have a puppet client on labs, though, at the very least. So I don't know that throwing the puppet master overboard gains us much [14:40:21] <_joe_> andrewbogott: give me the time to get to the point where I can confidently think most problems have been ironed out, then we'll probably install a 3.0 master in labs [14:40:30] <_joe_> er, 3.4.3 [14:40:41] ^d: chatting to Dan on Skype quick. [14:41:32] _joe_: I'm not sure what you mean by 'in labs'. You mean a 3.0 master to replace the current labs-wide master? [14:41:54] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [14:42:54] <_joe_> ah, snap [14:43:09] <_joe_> andrewbogott: or add another one on the side [14:43:34] hm... [14:44:13] Probably getting two different ones running side by side on the same host is not worth the trouble. I'm trying to think of where the other one would run [14:44:20] I guess it could run on labnet1001 maybe [14:44:31] andrewbogott: poke please when you have a moment (regarding firewall - pm would be the best, i guess) [14:49:33] ^d: back! [14:49:37] so i can swat [14:50:26] sdehaan: I'm here [14:50:34] I had to go and pick up a sick kid from school but I'm back now [14:51:14] matanya: sorry I had to drop for a bit. is that better? [14:51:27] manybubbles: looking [14:51:29] if so I can rebuild all the hebrew search backends except hewiki now [14:51:31] cool [14:52:12] manybubbles: big time! [14:52:24] no fp's [14:55:12] manybubbles: hey! [14:55:44] so our thinking was to hit the CirrusSearch with a bunch of actual query strings we're extracting from the Wikipedia Text service [14:56:19] we use tsung for that sort of stuff generally and I want to know what you guys' thoughts were on acceptable limits for testing? [14:56:25] matanya: sweet - I'll rebuild all of 'em [14:56:58] sdehaan: what kind of upper bound were you thinking? [14:57:29] sdehaan: I've been sending ~200/sec when I do load testing and it causes a spike, but not one that cripples us [14:58:11] manybubbles: we've seen services hit ~ 200 msg/s in Nigeria so I'd like to go with that. [14:58:27] manybubbles: and the load testing you've done is on the CirrusSearch backend? [14:58:32] sdehaan: yeah [14:58:49] sdehaan: go ahead and try 100/sec first and i'll watch it [14:59:18] sdehaan: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=es_query_time&s=by+name&c=Elasticsearch+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [14:59:31] ooh graphs. thanks! [14:59:49] ok let me wrap up collecting the search term samples [15:00:07] _joe_: how are all of these potential problems on labs not cropping up on our production Trusty boxes as well? Just lucky? [15:00:15] !log rebuilding all hebrew wikis _except_ hebrew wikipedia and hebrew wikisource to pick up hebmorph. hewikisource got it this morning. hewiki will get it this afternoon after the deployment train [15:00:22] Logged the message, Master [15:01:55] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [15:02:14] <_joe_> andrewbogott: what problems on labs are not in prod? [15:02:54] _joe_: I'm just referring to… potential problems, clases with ruby versioning mostly. [15:02:59] *clashes [15:03:05] * manybubbles has the conch [15:03:15] <_joe_> andrewbogott: they are equal in prod and labs [15:03:23] 'k [15:04:14] matanya: rebuilt everything but hewiki [15:04:40] (03CR) 10Manybubbles: [C: 032] Turn new highlighter on for more wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132097 (owner: 10Manybubbles) [15:04:52] (03Merged) 10jenkins-bot: Turn new highlighter on for more wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132097 (owner: 10Manybubbles) [15:08:08] !log manybubbles synchronized wmf-config/InitialiseSettings.php 'engage new hightlighter on some more wikis' [15:08:15] Logged the message, Master [15:08:59] sdehaan: small wrench - I'm rebuilding the index on enwiki to be more optimized - it shouldn't effect the testing _too much_ but it'll raise the baseline load [15:12:14] manybubbles: no worries, still collecting sample query data. [15:13:52] !log manybubbles synchronized php-1.24wmf3/extensions/CirrusSearch/ 'updating Cirrus to pick up some fixes' [15:13:57] Logged the message, Master [15:14:19] ^d: your fix is synced [15:14:23] * manybubbles puts down the conch [15:14:50] and so is mine [15:15:34] hi? [15:18:06] <^d> manybubbles: cool, thanks [15:25:14] manybubbles: sample extract taking a bit longer than expected, will ping you when about to start. Also, it's evening this side of the world. [15:25:36] sdehaan: yeah, I figured, let me know when you are ready [15:25:44] once you have a sample, can you control the rate? [15:46:40] <^d> aude, manybubbles: Soooo, looks like Tuesday's deploy didn't happen. [15:46:53] ^d: ....? [15:47:03] <^d> Nothing's on wmf3 other than mw.org and the 3 test wikis [15:48:01] ^d: it looks like it, yes [15:48:21] <^d> Well wmf4 probably shouldn't happen today if we haven't even rolled out wmf3 to more than 4 wikis. [15:48:25] I've rebuilt all the rebrew search indexes on the assumption that that did happen [15:48:47] push wmf3 to the group1? [15:49:09] <^d> I'm thinking of doing that today. Wanna talk to Dan since he's replacement Greg today. [15:49:16] yeah [15:49:25] maybe wmf4 to the tests would be ok too [15:49:30] but not wmf3 to group2 [15:49:47] which is annoying because hebrew analyzer is tied to that [15:49:54] matanya: ^^^ [15:50:04] too much backporting to get it into wmf2 [15:59:34] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [16:11:18] (03PS1) 10Springle: raise db106[45] to normal load [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132205 [16:11:50] (03CR) 10Springle: [C: 032] raise db106[45] to normal load [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132205 (owner: 10Springle) [16:11:57] (03Merged) 10jenkins-bot: raise db106[45] to normal load [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132205 (owner: 10Springle) [16:12:04] (03PS1) 10Chad: group1 wikis to 1.24wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132206 [16:13:25] (03CR) 10Chad: [C: 032] group1 wikis to 1.24wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132206 (owner: 10Chad) [16:13:33] (03Merged) 10jenkins-bot: group1 wikis to 1.24wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132206 (owner: 10Chad) [16:13:49] !log springle synchronized wmf-config/db-eqiad.php 'raise db106[45] to normal load' [16:13:55] Logged the message, Master [16:14:13] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to wmf3 [16:14:18] Logged the message, Master [16:19:02] (03PS2) 10Reedy: testwiki to 1.24wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132153 [16:19:57] (03CR) 10Reedy: [C: 032] testwiki to 1.24wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132153 (owner: 10Reedy) [16:20:05] (03Merged) 10jenkins-bot: testwiki to 1.24wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132153 (owner: 10Reedy) [16:20:41] <^d> Reedy: Wait, what? [16:20:45] <^d> wmf4? [16:20:51] <^d> I thought I was doing that today. [16:20:59] <^d> group1 wikis from tuesday weren't on wmf3 yet. [16:21:04] manybubbles|away: that can wait [16:22:01] I had no internet access on Tuesday [16:22:09] And yesterday I had little... [16:22:30] <^d> Now I'm all kinds of confused! [16:22:32] <^d> :) [16:22:40] I did the initial prep for wmf4 earlier today [16:23:40] <^d> I'm a little leery of moving group1 and 2 both to wmf3 on the same day. [16:25:45] Hmmmmmmmmm [16:25:55] I'd agree [16:26:06] And we can't push the rest of group0 to wmf4 [16:26:09] As APC will freak out [16:26:13] Can do testwiki though.. [16:26:26] There's enough staff around over the weekend for the hackathon [16:28:25] Shall I scap for testwiki? [16:31:54] PROBLEM - swift-container-updater on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:31:54] PROBLEM - swift-object-auditor on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:04] PROBLEM - swift-object-updater on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:04] PROBLEM - RAID on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:04] PROBLEM - check configured eth on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:14] can someone look at that? [16:32:18] as I have to leave in a bit? [16:32:22] that = ms-be1006 [16:32:24] PROBLEM - check if dhclient is running on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:24] PROBLEM - swift-container-auditor on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:24] PROBLEM - swift-container-server on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:25] PROBLEM - swift-account-auditor on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:34] PROBLEM - swift-account-reaper on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:34] PROBLEM - swift-container-replicator on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:44] PROBLEM - swift-object-server on ms-be1006 is CRITICAL: Timeout while attempting connection [16:32:44] PROBLEM - swift-object-replicator on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:45] PROBLEM - swift-account-replicator on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:45] PROBLEM - swift-account-server on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:45] PROBLEM - DPKG on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:54] PROBLEM - Disk space on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:54] PROBLEM - puppet disabled on ms-be1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:37:59] <^d> Reedy: Yeah sure. [16:38:49] opsens? [16:39:22] !log rebuilding enwiki's cirrus index to optimize for new highlighter [16:39:27] Logged the message, Master [16:39:48] ^d: so I started that a while ago - the content half finished - it was like 60% of its original size [16:40:00] <^d> yay :) [16:40:19] !log reedy Started scap: Build l10n cache for 1.24wmf4 and move testwiki [16:40:25] Logged the message, Master [16:40:33] chasemp, RobH, jgage? [16:40:40] <^d> manybubbles: group1 is on wmf3 now, so we can rebuild those...again [16:40:41] yo? [16:41:05] ms-be1006 [16:41:13] ? [16:41:43] can someone have a look/reboot it? [16:41:44] it's up poking at it now [16:42:26] !log reindexing the hebrew wikis other then hewikipedia now that they are on wmf3 so they can have hebmorph [16:42:32] Logged the message, Master [16:43:01] chasemp: lemme know what you do, i'd try restarting the swift services [16:43:28] restarting nrpe [16:43:29] thanks manybubbles [16:43:50] matanya: finished hewikibooks [16:43:57] may as well look at it [16:44:06] because this is the first one we _really_ got on hebmorph in prod [16:44:53] so it's up, neon can ping it, nrpe restart seems have done nadda [16:45:21] it has 167 instances of swift-object-server [16:46:48] yeah seems maybe resource problem [16:47:07] I could reboot..? but that's a big stick small brain tactic...as I have no idea how to debug swift [16:47:46] either someone just did or it died on it's own [16:48:05] a ton of i/o wait [16:48:23] i can't get in [16:48:31] yeah it disappeared just now [16:48:54] rebooting through console [16:49:24] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [16:49:34] ok [16:52:01] !log reedy Finished scap: Build l10n cache for 1.24wmf4 and move testwiki (duration: 11m 42s) [16:52:09] Logged the message, Master [16:52:14] RECOVERY - swift-container-auditor on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:52:14] RECOVERY - check if dhclient is running on ms-be1006 is OK: PROCS OK: 0 processes with command name dhclient [16:52:14] RECOVERY - DPKG on ms-be1006 is OK: All packages OK [16:52:14] RECOVERY - swift-container-server on ms-be1006 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [16:52:15] RECOVERY - swift-account-auditor on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [16:52:24] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [16:52:24] RECOVERY - swift-account-reaper on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [16:52:24] RECOVERY - swift-container-replicator on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [16:52:34] RECOVERY - Disk space on ms-be1006 is OK: DISK OK [16:52:34] RECOVERY - swift-account-replicator on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [16:52:34] RECOVERY - swift-account-server on ms-be1006 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [16:52:34] RECOVERY - swift-object-replicator on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [16:52:34] RECOVERY - swift-object-server on ms-be1006 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [16:52:44] RECOVERY - puppet disabled on ms-be1006 is OK: OK [16:52:44] RECOVERY - swift-container-updater on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [16:52:44] RECOVERY - swift-object-auditor on ms-be1006 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [16:52:52] !log rebooted ms-be1006 since it dropped dead [16:52:52] chasemp: was that your first live cluster repair? [16:52:54] RECOVERY - swift-object-updater on ms-be1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [16:52:54] RECOVERY - RAID on ms-be1006 is OK: OK: optimal, 14 logical, 14 physical [16:52:54] RECOVERY - check configured eth on ms-be1006 is OK: NRPE: Unable to read output [16:52:59] Logged the message, Master [16:53:11] !log reedy Started scap: Build l10n cache for 1.24wmf4 and move testwiki [16:53:13] RobH: probably? [16:53:15] * RobH hands chasemp a shiny barnstar for his first live cluster repair [16:53:26] achievement unlocked? [16:53:27] ;] [16:53:43] git pull helps [16:54:38] so the reboot seems to have stabilized whatever the hold on swift was, it was pegged at 100% across 8 or 9 cores [16:54:44] and now it's humming along...so yeah don't love that [16:54:57] if it doesn't happen twice it didn't happen at all? ;p [16:55:37] chasemp: oh, sometime before the next full moon you have to make a sacrifice to the server gods or ms-be1006 may come after you [16:55:53] I always put a ring of salt around my house at night [16:56:17] these are computers, not undead, you should add rare earth magnets to said ring ;] [16:56:27] just to be safe [16:57:51] who is the go-to for swift other than faidon in case? [16:58:05] I know in theory we all are but...yeah [16:58:05] ariel knows it some [16:58:15] i know a little, but i've not done much with it [16:58:23] and andrew b is gonna get involved in it [16:58:35] cool, got it [16:59:50] (03CR) 10Rush: [C: 031] "seems great" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132131 (owner: 10Dzahn) [17:01:59] anyway, high i/o wait, might have been a problem with a disk? [17:02:56] or maybe just thrashing to death [17:03:46] if I had to guess based on load and cpu and iowait I saw briefly, it got behind and started choking on it's own liver? [17:03:53] no idea the why of eitehr [17:04:28] manybubbles: mark as success [17:04:43] matanya: yay! rebuilding all the remaining ones. hewikisource is the biggest [17:06:04] (03CR) 10Dbrant: [C: 031] add account for Dmitry Brant [operations/puppet] - 10https://gerrit.wikimedia.org/r/132024 (owner: 10Dzahn) [17:06:10] chasemp: if this helps in any way: https://ganglia.wikimedia.org/latest/?c=Swift%20eqiad&h=ms-be1006.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [17:06:30] <_joe_> chasemp: what just happened to that server? [17:06:33] (03CR) 10Dbrant: [C: 031] add dbrant to mobile release uploaders [operations/puppet] - 10https://gerrit.wikimedia.org/r/132109 (owner: 10Dzahn) [17:06:34] PROBLEM - NTP on ms-be1006 is CRITICAL: NTP CRITICAL: Offset unknown [17:06:47] <_joe_> mmh this does not sound good [17:06:48] (03CR) 10Dbrant: [C: 031] add dbrant to stat1003 "special users" and bast [operations/puppet] - 10https://gerrit.wikimedia.org/r/132110 (owner: 10Dzahn) [17:07:05] _joe_: honestly not sure was investigating and it died, forced to reboot [17:07:08] and we're back and waiting [17:07:12] https://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&c=Swift+eqiad&h=ms-be1006.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS [17:07:21] <_joe_> chasemp: it is dead again? [17:07:28] something started at 14:00 UTC [17:08:03] <_joe_> matanya: i/o wait [17:08:09] yeah [17:08:36] "something" as in root casue [17:08:40] _joe_: nope seems up [17:09:17] May 8 16:45:58 ms-be1006 kernel: [12987936.324094] [] _raw_spin_lock+0xe/0x20 [17:09:17] May 8 16:45:58 ms-be1006 kernel: [12987936.324110] [] xfs_ail_min_lsn+0x24/0x60 [xfs] [17:09:30] <_joe_> ook [17:09:37] <_joe_> I heart XFS [17:09:41] XFS issues for hours [17:10:03] check syslog for more details [17:10:17] !log reedy Finished scap: Build l10n cache for 1.24wmf4 and move testwiki (duration: 17m 05s) [17:10:23] Logged the message, Master [17:10:26] <_joe_> I am [17:11:34] RECOVERY - NTP on ms-be1006 is OK: NTP OK: Offset -0.0007935762405 secs [17:21:07] MOOF [18:14:54] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [18:17:01] (03PS1) 10Alexandros Kosiaris: bacula: allow mysqldumps to be kept locally [operations/puppet] - 10https://gerrit.wikimedia.org/r/132214 [18:17:03] (03PS1) 10Alexandros Kosiaris: Backup role::mariadb::dbstore [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 [18:18:05] (03CR) 10jenkins-bot: [V: 04-1] bacula: allow mysqldumps to be kept locally [operations/puppet] - 10https://gerrit.wikimedia.org/r/132214 (owner: 10Alexandros Kosiaris) [18:18:25] (03CR) 10jenkins-bot: [V: 04-1] Backup role::mariadb::dbstore [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 (owner: 10Alexandros Kosiaris) [18:36:34] (03PS1) 10Ori.livneh: Tidy ::applicationserver & ::applicationserver::pybal_check [operations/puppet] - 10https://gerrit.wikimedia.org/r/132217 [18:36:36] (03PS1) 10Ori.livneh: Move diamond::generic to manifests/ and lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/132218 [18:36:57] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [18:38:47] (03Abandoned) 10Hoo man: Introduce an admins::release user group [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [18:39:06] (03PS1) 10Dr0ptp4kt: Change 470-01 to zerodot only. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132219 [18:41:38] bblack, when you get a minute would you please review and, if appropriate, +2 merge and deploy https://gerrit.wikimedia.org/r/#/c/132219/ ? [18:52:01] (03CR) 10ArielGlenn: [C: 031] "Makes sense to me." [operations/puppet] - 10https://gerrit.wikimedia.org/r/132026 (owner: 10Dzahn) [18:53:33] (03CR) 10Dzahn: [C: 032] allow manybubbles to run icinga commands [operations/puppet] - 10https://gerrit.wikimedia.org/r/132026 (owner: 10Dzahn) [18:59:36] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [19:20:59] * matanya looks for merges volunteer  [19:21:07] dr0ptp4kt: ah, was this what caused the 470-01 weirdness the other day? [19:21:31] (03PS2) 10BBlack: Change 470-01 to zerodot only. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132219 (owner: 10Dr0ptp4kt) [19:21:59] (03CR) 10BBlack: [C: 032 V: 032] Change 470-01 to zerodot only. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132219 (owner: 10Dr0ptp4kt) [19:22:29] dr0ptp4kt: also, do we still need some cache flush related to 470-01? [19:23:35] (03CR) 10Rush: [C: 04-2] "in favor of this approach: https://gerrit.wikimedia.org/r/#/c/132218/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132218 (owner: 10Ori.livneh) [19:25:00] (03CR) 10Rush: "in favor of this approach: https://gerrit.wikimedia.org/r/#/c/132131/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132218 (owner: 10Ori.livneh) [19:25:30] (03CR) 10Rush: "thanks needed to do some version of this" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132131 (owner: 10Dzahn) [19:25:52] (03PS3) 10Dzahn: role class for diamond, move generic into init [operations/puppet] - 10https://gerrit.wikimedia.org/r/132131 [19:27:47] (03PS4) 10Dzahn: role class for diamond, move generic into init [operations/puppet] - 10https://gerrit.wikimedia.org/r/132131 [19:29:24] (03CR) 10Dzahn: [C: 032] role class for diamond, move generic into init [operations/puppet] - 10https://gerrit.wikimedia.org/r/132131 (owner: 10Dzahn) [19:33:30] mutante: when you are done there, i would like some of my patches to get merged, if possible [19:35:21] matanya: away for lunch break first.. then let's talk [19:35:42] (03PS2) 10Ori.livneh: Tidy ::applicationserver & ::applicationserver::pybal_check [operations/puppet] - 10https://gerrit.wikimedia.org/r/132217 [19:35:44] sure, bon appetite [19:35:44] or link me in a PM [19:35:49] thx, ttyl [19:36:54] (03CR) 10Dzahn: "watched on carbon, saw no puppet change (good)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132131 (owner: 10Dzahn) [19:46:24] (03PS2) 10Alexandros Kosiaris: Backup role::mariadb::dbstore [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 [19:46:26] (03PS2) 10Alexandros Kosiaris: bacula: allow mysqldumps to be kept locally [operations/puppet] - 10https://gerrit.wikimedia.org/r/132214 [19:47:23] (03CR) 10jenkins-bot: [V: 04-1] Backup role::mariadb::dbstore [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 (owner: 10Alexandros Kosiaris) [19:52:53] ^d: still around? [19:53:16] <^d> Yes yes, just finishing off lunch. [19:53:53] ah good. fabriceflorin just asked me about the MediaViewer rollout to commons, and I'm guessing that may have slipped off the radar [19:54:36] <^d> robla: It wasn't mentioned to me. iirc greg said we weren't going to do it because of people traveling but I might be misremembering. [19:54:54] err...you're right about commons. I misspoke [19:55:12] however, there was a scaled back rollout that was planned [19:55:42] Tgr: what's the patch? [19:55:46] <^d> robla: Ah yes, I see it on the calendar now. [19:56:03] robla: https://gerrit.wikimedia.org/r/#/c/125035/ but needs fixing [19:56:23] <^d> robla: There was a bit of a mix-up this morning with the general train deploy anyway so I'm treading lightly today. [19:56:26] (just spoke to Tgr...he's fixing now) [19:56:27] <^d> We're a little behind. [19:56:28] robla: Thanks for checking on this release. tgr is now updating the Gerrit change 125035 so it only includes Japanes, Portuguese, Spanish and Swedish Wikipedias [19:56:41] (03PS3) 10Alexandros Kosiaris: Backup role::mariadb::dbstore [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 [19:57:50] ^d Thanks for the update. Looks like we were a bit behind ourselves, as the Gerrit ticket included two sites that we don’t want this week (Commons and Telugu, due to launch next week instead). See the updated release plan: https://www.mediawiki.org/wiki/Multimedia/Media_Viewer/Release_Plan#Large_Wikis [19:59:03] (03PS5) 10Gergő Tisza: FUTURE: Fifth batch of pilot sites for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125035 (owner: 10MarkTraceur) [20:00:10] (03CR) 10RobLa: [C: 031] FUTURE: Fifth batch of pilot sites for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125035 (owner: 10MarkTraceur) [20:00:15] (03CR) 10Gergő Tisza: "Removed Commons (should be deployed separately from all else) and Telugu (community wanted a different date)." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125035 (owner: 10MarkTraceur) [20:00:16] tgr: Thanks for doing this! And thanks as well to ^d for handling this — Much appreciated :) [20:01:08] (03CR) 10Chad: [C: 032] FUTURE: Fifth batch of pilot sites for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125035 (owner: 10MarkTraceur) [20:01:16] (03Merged) 10jenkins-bot: FUTURE: Fifth batch of pilot sites for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125035 (owner: 10MarkTraceur) [20:01:23] <^d> Blargh. [20:01:27] <^d> Should've removed "FUTURE" [20:02:44] we now have a sad little paradox entombed in our gerrit history :-) [20:03:26] robla: hehe. Will give archeologists a nice puzzle to investigate in the future :) [20:04:11] !log demon synchronized mediaviewer.dblist 'mediaviewer for svwiki, eswiki, jawiki, ptwiki' [20:04:18] Logged the message, Master [20:05:31] !log demon synchronized wmf-config/InitialiseSettings.php 'touch' [20:05:38] Logged the message, Master [20:06:47] <^d> fabriceflorin: Looks like it's working for me on eswiki :) [20:07:26] <^d> Nothing's blowing up. Back in 5. [20:07:34] ^d: Yes, it’s working fine on spanish and swedish, but haven’t been able to get them to work on portuguese and japanese yet. [20:11:44] robla ^d : here are the ganglia and other ops graps for monitoring the impact of Media Viewer: https://www.mediawiki.org/wiki/Multimedia/Metrics#Ops [20:15:43] <^d> It's working for me on jawiki. [20:15:49] <^d> eg: https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8#mediaviewer/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB:Notocactus_minimus.jpg [20:19:33] ^d : yes, I am now able to see it on jawiki and ptwiki as well. Thanks so much for making this possible! [20:19:41] <^d> you're welcome [20:19:48] :) [20:22:46] (03CR) 10Dzahn: [C: 032] "indeed same as Change-Id: Ifdd89aec64" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130042 (owner: 10Matanya) [20:31:19] (03CR) 10Dzahn: "well that is this key. i looked at the one class mediawiki::users::l10nupdate and that is the same key that is going to be deleted here" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116936 (owner: 10Matanya) [20:31:43] bblack, i don't think that was the source of the bizarre stuff. that said, they were the second one to see strange behavior. i curl'd all operators on mdot and zerodot around noon pacific yesterday for en.(m|zero).wikipedia.org/wiki/Main_Page yesterday and the banners and stuff looked alright. we're going to add some wfDebuglog calls to the interesting state changes for memcache (save, cold read, invalid values, etc.) as that's my hu [20:31:44] yet [20:32:48] !log reedy updated /a/common to {{Gerrit|I11e5ca294}}: FUTURE: Fifth batch of pilot sites for Media Viewer [20:32:52] bblack: that particular operator though is going to zerodot stuff, though [20:32:55] Logged the message, Master [20:32:59] (03PS2) 10Reedy: Wikipedias to 1.24wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132154 [20:35:41] (03PS2) 10Reedy: group0 to 1.24wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132155 [20:35:43] (03PS3) 10Reedy: Wikipedias to 1.24wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132154 [20:35:45] (03CR) 10Chad: [C: 032] Wikipedias to 1.24wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132154 (owner: 10Reedy) [20:35:50] (03Merged) 10jenkins-bot: Wikipedias to 1.24wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132154 (owner: 10Reedy) [20:36:29] (03CR) 10Chad: [C: 032] group0 to 1.24wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132155 (owner: 10Reedy) [20:36:36] (03Merged) 10jenkins-bot: group0 to 1.24wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132155 (owner: 10Reedy) [20:37:15] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf3 and group0 to 1.24wmf4 [20:37:21] Logged the message, Master [20:41:43] (03CR) 10Alexandros Kosiaris: "I got the promised patchset here" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131976 (owner: 10Springle) [20:42:11] (03PS1) 10Tim Landscheidt: Tools: Install npm for users' use [operations/puppet] - 10https://gerrit.wikimedia.org/r/132238 [20:46:04] (03CR) 10Petrb: [C: 031] Tools: Install npm for users' use [operations/puppet] - 10https://gerrit.wikimedia.org/r/132238 (owner: 10Tim Landscheidt) [20:49:34] (03CR) 10Calak: [C: 031] Tools: Install npm for users' use [operations/puppet] - 10https://gerrit.wikimedia.org/r/132238 (owner: 10Tim Landscheidt) [20:53:33] (03Abandoned) 10Dr0ptp4kt: Prepare log for zeromemcache state change logging. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132134 (owner: 10Dr0ptp4kt) [20:54:36] (03CR) 10Alexandros Kosiaris: "Great! Nice Daniel, I will amend this and merge then (funny this patch ended up changing just a couple of chars. I will also amend the com" [operations/puppet] - 10https://gerrit.wikimedia.org/r/116936 (owner: 10Matanya) [20:58:30] (03PS6) 10Alexandros Kosiaris: Change comment on autoinstall authorized_key [operations/puppet] - 10https://gerrit.wikimedia.org/r/116936 (owner: 10Matanya) [21:00:39] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Change comment on autoinstall authorized_key [operations/puppet] - 10https://gerrit.wikimedia.org/r/116936 (owner: 10Matanya) [21:00:39] akosiaris: sanity check for ferm changes.. so let's say we have changes for ALL roles on a node, and the change to add base::firewall.. i can first merge all the changes to the role classes , nothing will happen.. and at the very end i add base:firewall.. or? [21:02:10] mutante: yes that would be the best way to go [21:02:34] in fact you can merge ferm::rule and ferm::service changes any point in time [21:02:58] akosiaris: great, just wanted to confirm, i'm about to do that with zirconium [21:03:02] it has quite a few roles [21:03:08] but lgtm [21:03:15] unless base::firewall is also included, they will not be realized since there are virtual resources underneath [21:03:43] yea, it's nice that way, if anything goes wrong can just revert a single (the last) change [21:03:55] good [21:04:38] (03PS3) 10Dzahn: wikimania_scholarships: add ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/130379 (owner: 10Matanya) [21:04:58] (03CR) 10Dzahn: [C: 032] planet: add ferm rules [operations/puppet] - 10https://gerrit.wikimedia.org/r/130311 (owner: 10Matanya) [21:06:13] (03CR) 10Dzahn: [C: 032] etherpad: add ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/130377 (owner: 10Matanya) [21:06:47] (03PS2) 10Dzahn: contacts: add ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/130374 (owner: 10Matanya) [21:07:08] (03CR) 10Dzahn: [C: 032] wikimania_scholarships: add ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/130379 (owner: 10Matanya) [21:08:09] (03CR) 10Dzahn: [C: 032] bugzilla: add ferm rules [operations/puppet] - 10https://gerrit.wikimedia.org/r/130334 (owner: 10Matanya) [21:08:39] (03CR) 10Dzahn: [C: 032] contacts: add ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/130374 (owner: 10Matanya) [21:09:29] (03PS2) 10Dzahn: zirconium: add firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/130380 (owner: 10Matanya) [21:13:06] (03CR) 10Dzahn: [C: 032] zirconium: add firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/130380 (owner: 10Matanya) [21:14:19] now the interesting one [21:15:35] matanya: applied, lgtm :) [21:15:42] iptables rules in place, bugzilla still reachable etc [21:15:45] akosiaris: [21:15:58] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [21:16:41] (03CR) 10Dzahn: "iptables rules got applied without errors. Bugzilla, Etherpad, contacts etc. still reachable. :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130380 (owner: 10Matanya) [21:18:35] matanya: -7 on the list?:) [21:20:21] (03CR) 10Dzahn: [C: 032] rt: add ferm rules [operations/puppet] - 10https://gerrit.wikimedia.org/r/130312 (owner: 10Matanya) [21:21:04] (03CR) 10Dzahn: [C: 032] racktables: add ferm rules [operations/puppet] - 10https://gerrit.wikimedia.org/r/130313 (owner: 10Matanya) [21:23:14] (03PS3) 10Dzahn: magnesium: add firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/130322 (owner: 10Matanya) [21:25:07] (03CR) 10Dzahn: [C: 032] magnesium: add firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/130322 (owner: 10Matanya) [21:27:01] (03CR) 10Dzahn: "iptables rules got applied. RT and racktables still reachable." [operations/puppet] - 10https://gerrit.wikimedia.org/r/130322 (owner: 10Matanya) [21:28:03] thanks a lot mutante bill me later [21:28:44] !log reedy updated /a/common to {{Gerrit|I44f67444c}}: group0 to 1.24wmf4 [21:28:49] (03PS1) 10Reedy: wmgVectorBetaPersonalBar to true for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132301 [21:28:51] Logged the message, Master [21:29:05] (03CR) 10Dzahn: [C: 04-1] "please also add 443 here" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130569 (owner: 10Matanya) [21:29:57] (03PS1) 10Rush: partmon and dhcp for argon [operations/puppet] - 10https://gerrit.wikimedia.org/r/132302 [21:30:21] sloooooooooooooow [21:30:36] !log reedy synchronized wmf-config/InitialiseSettings.php 'wmgVectorBetaPersonalBar to true for all wikis' [21:30:43] Logged the message, Master [21:31:04] (03CR) 10Reedy: [C: 032] wmgVectorBetaPersonalBar to true for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132301 (owner: 10Reedy) [21:31:12] (03Merged) 10jenkins-bot: wmgVectorBetaPersonalBar to true for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132301 (owner: 10Reedy) [21:31:56] (03PS3) 10Matanya: releases: add ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/130569 [21:32:12] (03CR) 10Dzahn: "actually, nevermind, it doesn't really listen on 443, i just saw ports.conf: Listen 443" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130569 (owner: 10Matanya) [21:32:47] (03CR) 10Rush: [C: 032] "doesn't affect any existing functionality so ...going for it" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132302 (owner: 10Rush) [21:32:55] (03PS1) 10Manybubbles: Readd replica count for commons' file index [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132303 [21:33:34] matanya: sorry, i was wrong [21:33:37] (03PS3) 10Ori.livneh: Tidy ::applicationserver & ::applicationserver::pybal_check [operations/puppet] - 10https://gerrit.wikimedia.org/r/132217 [21:33:46] matanya: it's behind misc varnish ,, right [21:34:05] well, ok [21:35:04] (03CR) 10Dzahn: [C: 032] "better be flexible about being behind varnish or not" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130569 (owner: 10Matanya) [21:35:46] mutante: great! [21:35:52] (03PS3) 10Dzahn: caesium: add firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/130581 (owner: 10Matanya) [21:37:51] (03CR) 10Dzahn: [C: 032] "and another host, this has _just_ releases.wm.org" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130581 (owner: 10Matanya) [21:39:57] (03CR) 10Dzahn: "ACCEPT tcp -- anywhere anywhere tcp dpt:http" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130581 (owner: 10Matanya) [21:40:42] regarding releases, that is what ariel said, but better being flexible, ture [21:47:37] (03PS1) 10Robmoen: Enable anonymous editor acquisition experiment across labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132308 [21:48:18] (03CR) 10Dzahn: "not sure if the rules in role/gitblit are enough yet, i also see java listening on 8081 (besides 8080) and on 9418, also would be nice to " [operations/puppet] - 10https://gerrit.wikimedia.org/r/130306 (owner: 10Matanya) [21:49:43] (03CR) 10Dzahn: [C: 032] "ah, those are just on localhost though, that should be fine, nvm" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130306 (owner: 10Matanya) [21:50:02] (03PS3) 10Dzahn: antimony: add firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/130306 (owner: 10Matanya) [21:53:07] (03CR) 10Dzahn: [C: 032] antimony: add firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/130306 (owner: 10Matanya) [21:56:21] (03PS1) 10Rush: dns for argon.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/132311 [21:57:32] (03CR) 10RobH: [C: 031] dns for argon.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/132311 (owner: 10Rush) [21:57:54] (03CR) 10Rush: [C: 032] dns for argon.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/132311 (owner: 10Rush) [21:58:04] matanya: that last one, it was already using the rules [21:58:36] matanya: role::gitblit already did include base::firewall before.. but [21:59:04] i still think the change is good like that [21:59:38] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [21:59:58] (03CR) 10Dzahn: "effectively this was already on antimony because role::gitblit includes firewall, but not saying this was wrong, since we do it like this," [operations/puppet] - 10https://gerrit.wikimedia.org/r/130306 (owner: 10Matanya) [22:09:18] greg-g, could i push out https://gerrit.wikimedia.org/r/#/c/132132/4 today? [22:09:30] we are having issues with zero [22:10:45] he's probably not around [22:18:22] TemplateData API calls don't look like they're working [22:18:47] example? [22:19:03] Reedy: Canonical example in api.php is http://en.wikipedia.org/w/api.php?action=templatedata&titles=Template:Stub|Template:Example [22:19:26] Which sort of makes sense; neither of those look to have TD. [22:19:47] https://en.wikipedia.org/w/api.php?action=templatedata&titles=Template:Infobox_person [22:19:50] Oh sorry [22:20:03] Reedy: The example just needs updating. Thanks for humouring :) [22:20:06] https://commons.wikimedia.org/w/api.php?action=templatedata&titles=Template:Information also working. [22:21:23] Filing bug about failing example [22:24:25] (03CR) 10Dzahn: "unfortunately the "salt on puppet role" thing won't work currently. but going to try and fix it" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125888 (https://bugzilla.wikimedia.org/36422) (owner: 10BryanDavis) [22:26:46] (03PS2) 10Dzahn: create admins::bastion for _just_ bastion access [operations/puppet] - 10https://gerrit.wikimedia.org/r/131743 [22:54:05] (03CR) 10Dzahn: [C: 032] create admins::bastion for _just_ bastion access [operations/puppet] - 10https://gerrit.wikimedia.org/r/131743 (owner: 10Dzahn) [22:56:45] hey, I'm ready to do SWAT [22:57:34] ARE YOU READY? [22:57:37] ARE YOU READY TO SWAT? [22:58:44] to this tune: http://www.youtube.com/watch?v=qyXGyxxw7dw ? [22:59:43] (03CR) 10Dzahn: "no-op on bast1001, just put people in a group that were "manually" on bast1001 before" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131743 (owner: 10Dzahn) [22:59:47] no, this: https://www.youtube.com/watch?v=Pm8JvqFK_3A [23:00:16] :) better [23:01:18] PROBLEM - Disk space on analytics1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 97131 MB (5% inode=99%): /var/lib/hadoop/data/l 75080 MB (3% inode=99%): /var/lib/hadoop/data/e 97772 MB (5% inode=99%): /var/lib/hadoop/data/g 114669 MB (6% inode=99%): /var/lib/hadoop/data/c 111894 MB (5% inode=99%): /var/lib/hadoop/data/i 79309 MB (4% inode=99%): [23:01:34] oh noes [23:03:25] (03CR) 10Chad: [C: 032] "Oh yeah, forgot _file isn't a default." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132303 (owner: 10Manybubbles) [23:05:33] (03Merged) 10jenkins-bot: Readd replica count for commons' file index [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132303 (owner: 10Manybubbles) [23:07:25] !log maxsem synchronized php-1.24wmf4/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/132299' [23:07:32] Logged the message, Master [23:08:19] !log demon synchronized wmf-config/CirrusSearch-common.php 'Replica count for commonswiki_file -- syncing with whats already live' [23:08:25] Logged the message, Master [23:10:11] time to sleep. mutante remind me to create a ferm role to hold common ports, so can just do: include ferm {http} next time [23:10:48] matanya: good night (http is just 80 though) [23:10:52] ttyl [23:11:02] !log maxsem synchronized php-1.24wmf3/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/132299' [23:11:09] Logged the message, Master [23:12:26] SWAT done [23:12:58] Reedy, areyou aware of lulz happening in fatalmonitor? [23:14:00] ha MaxSem what lulz? I might have done something stupid with that [23:14:43] lotsa "header already sent" stuff [23:15:31] and in exceptions log, there's "Please contact a developer' [23:16:09] the latter is from GWT [23:16:13] MaxSem: eh, I am mistaken, I did something stupid elsewhere :-) [23:16:30] don't worry, we all do [23:17:42] MaxSem: I have a small script that (should) monitor the fatal log on beta, I thought that's what you were talking about. It isn't behaving properly, but it's so minor I haven't figured out what's wrong with it. [23:19:06] poke bd808|BUFFER to set up logstash for beta? [23:19:14] (if it's not there already) [23:26:52] git review -d 132110 [23:26:53] ... [23:26:58] Switched to branch "review/dzahn/132024" [23:27:00] why? [23:27:11] that's a different number, git-review [23:27:18] PROBLEM - Disk space on analytics1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 96982 MB (5% inode=99%): /var/lib/hadoop/data/l 75072 MB (3% inode=99%): /var/lib/hadoop/data/e 97756 MB (5% inode=99%): /var/lib/hadoop/data/g 114140 MB (6% inode=99%): /var/lib/hadoop/data/c 110048 MB (5% inode=99%): /var/lib/hadoop/data/i 79710 MB (4% inode=99%): [23:29:42] because i messed up the topic branch.. sigh..ok [23:32:33] (03PS2) 10Dzahn: add account for Dmitry Brant [operations/puppet] - 10https://gerrit.wikimedia.org/r/132024 [23:34:51] (03CR) 10Dzahn: "the above comment makes me think that we should not have bastion hosts do anything else, besides just run ssh and have user accounts basic" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [23:36:58] (03CR) 10Dzahn: [C: 032] "has been sitting for a week, with approval" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132024 (owner: 10Dzahn) [23:37:46] (03PS2) 10Dzahn: add dbrant to mobile release uploaders [operations/puppet] - 10https://gerrit.wikimedia.org/r/132109 [23:44:08] PROBLEM - Disk space on analytics1013 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 73067 MB (3% inode=99%): [23:56:22] (03CR) 10MaxSem: [C: 031] add dbrant to mobile release uploaders [operations/puppet] - 10https://gerrit.wikimedia.org/r/132109 (owner: 10Dzahn) [23:57:50] (03PS3) 10Dzahn: add dbrant to mobile release uploaders [operations/puppet] - 10https://gerrit.wikimedia.org/r/132109 [23:58:18] (03CR) 10Dzahn: [C: 032] add dbrant to mobile release uploaders [operations/puppet] - 10https://gerrit.wikimedia.org/r/132109 (owner: 10Dzahn) [23:58:44] (03PS1) 10Tim Landscheidt: Tools: Install jq and pdftk [operations/puppet] - 10https://gerrit.wikimedia.org/r/132322 (https://bugzilla.wikimedia.org/65048)