[00:07:33] (03CR) 10Whym: "Sorry for the confusion. The new idea came up with at the new bug report yesterday. 'wotd' is not bad, but I believe 'featuredwords' is " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136316 (https://bugzilla.wikimedia.org/66015) (owner: 10Whym) [00:13:48] (03CR) 10Aaron Schulz: "Discussed IRL; suspected issue is the bash jobs table getting out of sync." [operations/puppet] - 10https://gerrit.wikimedia.org/r/135875 (owner: 10Aaron Schulz) [00:14:08] (03PS1) 10Ori.livneh: sysctl: refresh service if params updated [operations/puppet] - 10https://gerrit.wikimedia.org/r/136948 [00:14:23] ^ bblack [00:15:04] (03CR) 10GWicke: [C: 031] "+1 for automating the work-around for now." [operations/puppet] - 10https://gerrit.wikimedia.org/r/135875 (owner: 10Aaron Schulz) [00:20:38] (03PS2) 10BBlack: role::mediawiki::job_runner: same config for beta & prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/136347 (owner: 10Ori.livneh) [00:21:05] (03CR) 10BBlack: [C: 032 V: 032] role::mediawiki::job_runner: same config for beta & prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/136347 (owner: 10Ori.livneh) [00:21:20] (03PS2) 10BBlack: get rid of {jobrunner,videoscaler}-apache-service-stopped Execs [operations/puppet] - 10https://gerrit.wikimedia.org/r/136349 (owner: 10Ori.livneh) [00:21:47] (03CR) 10BBlack: [C: 032 V: 032] get rid of {jobrunner,videoscaler}-apache-service-stopped Execs [operations/puppet] - 10https://gerrit.wikimedia.org/r/136349 (owner: 10Ori.livneh) [00:22:07] (03PS2) 10BBlack: mediawiki::web: make $maxclients numeric; simplify config [operations/puppet] - 10https://gerrit.wikimedia.org/r/136351 (owner: 10Ori.livneh) [00:22:39] (03CR) 10BBlack: [C: 032 V: 032] mediawiki::web: make $maxclients numeric; simplify config [operations/puppet] - 10https://gerrit.wikimedia.org/r/136351 (owner: 10Ori.livneh) [00:24:47] (03PS2) 10Ori.livneh: replace Service['procps'] with an Exec [operations/puppet] - 10https://gerrit.wikimedia.org/r/136948 [00:25:23] bblack: <3 <3 <3 ! could you also look at aaron's patch, https://gerrit.wikimedia.org/r/#/c/135875/ ? it got +1s from the relevant folks but no opsen [00:25:40] (03CR) 10MaxSem: "And so translators's time has been wasted on this stuff. Also, now there are no messages on enwiktionary corresponding to this feed." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136316 (https://bugzilla.wikimedia.org/66015) (owner: 10Whym) [00:27:35] (03PS3) 10BBlack: replace Service['procps'] with an Exec [operations/puppet] - 10https://gerrit.wikimedia.org/r/136948 (owner: 10Ori.livneh) [00:27:47] (03CR) 10BBlack: [C: 032 V: 032] replace Service['procps'] with an Exec [operations/puppet] - 10https://gerrit.wikimedia.org/r/136948 (owner: 10Ori.livneh) [00:29:57] (03PS2) 10BBlack: Periodically restart job runners to avoid pipeline shrinking issue [operations/puppet] - 10https://gerrit.wikimedia.org/r/135875 (owner: 10Aaron Schulz) [00:30:10] (03CR) 10BBlack: [C: 032 V: 032] Periodically restart job runners to avoid pipeline shrinking issue [operations/puppet] - 10https://gerrit.wikimedia.org/r/135875 (owner: 10Aaron Schulz) [00:30:37] \o/ [00:32:27] bblack: patches did the right thing (i.e. in most cases nothing, since they're mostly refactors) on all the nodes i've tested so far [00:32:57] * ori does a little dance [00:33:26] cool [00:34:31] (03CR) 10Whym: "Ok, then I'd revert this to "wotd". I cannot ignore 50+ translations on TWN. "featuredwords" may be revisited as an addition when other ty" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136316 (https://bugzilla.wikimedia.org/66015) (owner: 10Whym) [00:35:13] (03PS2) 10BBlack: Set rx ring params for bnx2x on 10GbE LVS [operations/puppet] - 10https://gerrit.wikimedia.org/r/136944 [00:36:22] (03CR) 10BBlack: [C: 032 V: 032] Set rx ring params for bnx2x on 10GbE LVS [operations/puppet] - 10https://gerrit.wikimedia.org/r/136944 (owner: 10BBlack) [00:37:32] Does anybody know why there's no #central (for CentralAuth) on irc.wikimedia.org? [00:37:58] There's no wiki in the centralauth database [00:38:04] The closest equivalent would be loginwiki, I guess? [00:38:50] no, it used to be that that channel was used for CentralAuth logging [00:39:11] and not loginwiki [00:39:52] (03PS1) 10BBlack: fix s/ring/setting/ in interface::ring usage [operations/puppet] - 10https://gerrit.wikimedia.org/r/136955 [00:40:18] (03CR) 10BBlack: [C: 032 V: 032] fix s/ring/setting/ in interface::ring usage [operations/puppet] - 10https://gerrit.wikimedia.org/r/136955 (owner: 10BBlack) [00:40:30] Did someone mess with the RC feed today? [00:40:54] It stopped at about 12:07 [00:41:06] PM CST ^ [00:41:08] (03PS1) 10Ori.livneh: remove pmtpa app server monitor_groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/136956 [00:41:13] Bsadowski1: the irc server migrated [00:41:20] hm [00:41:23] chasemp: ^ ? [00:43:17] (03PS1) 10BBlack: fix ethtool command for ring as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/136957 [00:43:30] (03CR) 10BBlack: [C: 032 V: 032] fix ethtool command for ring as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/136957 (owner: 10BBlack) [00:43:54] Just confirmed its working fine for me, maybe you have the old server hardcoded or stale DNS? [00:44:08] is it a known issue that redis causes errors on beta-hhvm? [00:44:08] IRC.Wikimedia.org that is [00:44:11] What? [00:44:16] I'm referring to the channel [00:44:27] jackmcbarn: no; what's the issue? [00:44:32] There was a channel on irc.wikimedia.org for CentralAuth logging [00:45:23] ori: on occasion, when i create/import a page, i get an internal error that redis failed something about search indexes [00:45:26] let me see if i can repro it now [00:45:36] OK, my guess is no messages logged to it so no channel yet. They are not created until there is action I believe, but what is thebcannel name [00:45:44] jackmcbarn: that's awesome to know you're testing things on hhvm [00:46:30] ori: https://dpaste.de/Hy1p/raw [00:46:38] oddly, the page did get created [00:47:30] jackmcbarn: thanks! i think i know what this is about (there were several bugs in the redis client in hhvm that have been fixed in master, but not in the version deployed in labs) [00:47:36] i'll test to see if that's the issue [00:50:01] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 30 May 2014 18:25:33 UTC [00:54:01] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 18:51:58 UTC [00:59:29] (03PS3) 10BryanDavis: beta: New script to restart apaches [operations/puppet] - 10https://gerrit.wikimedia.org/r/125888 (https://bugzilla.wikimedia.org/36422) [01:00:00] (03PS7) 10BryanDavis: Labs: Fix beta to work with role::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/134519 [01:00:33] (03PS3) 10BryanDavis: Move OCG default port to 8000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/134975 (owner: 10Mwalker) [01:00:55] (03PS6) 10BryanDavis: beta: bring in mediawiki/skins.git [operations/puppet] - 10https://gerrit.wikimedia.org/r/136325 (https://bugzilla.wikimedia.org/65868) (owner: 10Hashar) [01:01:01] PROBLEM - Puppet freshness on analytics1013 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 18:59:02 UTC [01:01:13] (03PS5) 10BryanDavis: Make GeoIP lookup code safer [operations/puppet] - 10https://gerrit.wikimedia.org/r/136655 (https://bugzilla.wikimedia.org/64582) (owner: 10Ori.livneh) [01:01:49] (03PS2) 10Krinkle: Add custom Diamond collector for RCStream [operations/puppet] - 10https://gerrit.wikimedia.org/r/136621 (owner: 10Ori.livneh) [01:02:04] (03CR) 10Krinkle: "Rebased to resolve outdated dependency" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136621 (owner: 10Ori.livneh) [01:05:01] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:03:36 UTC [01:08:01] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:06:50 UTC [01:11:28] (03PS1) 10BryanDavis: beta: Remove File['/usr/local/apache/common'] from ::beta::common [operations/puppet] - 10https://gerrit.wikimedia.org/r/136963 [01:12:06] PROBLEM - Puppet freshness on analytics1015 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:10:41 UTC [01:13:46] (03CR) 10BryanDavis: "Cherry-picked on deployment-salt and applied on deployment-apache01 to verify." [operations/puppet] - 10https://gerrit.wikimedia.org/r/136963 (owner: 10BryanDavis) [01:14:01] PROBLEM - Puppet freshness on analytics1014 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:12:17 UTC [01:16:22] PROBLEM - Disk space on labstore1001 is CRITICAL: DISK CRITICAL - free space: / 707 MB (3% inode=92%): [01:18:01] PROBLEM - Puppet freshness on analytics1009 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:16:05 UTC [01:21:01] PROBLEM - Puppet freshness on analytics1019 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:19:33 UTC [01:24:01] PROBLEM - Puppet freshness on analytics1017 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:22:40 UTC [01:25:01] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:23:15 UTC [01:26:01] PROBLEM - Puppet freshness on analytics1018 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:24:16 UTC [01:30:01] PROBLEM - Puppet freshness on analytics1016 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:28:14 UTC [01:31:22] (03PS1) 10Reedy: wgCentralAuthRC to EQIAD rc ircd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136964 [01:32:00] !log reedy Synchronized wmf-config/CommonSettings.php: wgCentralAuthRC to EQIAD rc ircd (duration: 00m 14s) [01:32:07] Logged the message, Master [01:32:15] (03CR) 10Reedy: [C: 032] wgCentralAuthRC to EQIAD rc ircd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136964 (owner: 10Reedy) [01:32:21] (03Merged) 10jenkins-bot: wgCentralAuthRC to EQIAD rc ircd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136964 (owner: 10Reedy) [01:41:31] (03PS1) 10Reedy: Revert "keep rc-pmtpa name for now" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136965 [01:41:47] (03CR) 10Reedy: [C: 04-1] "Needs announcing and stuff" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136965 (owner: 10Reedy) [02:11:12] PROBLEM - Disk space on searchidx1001 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=56%): [02:15:15] !log LocalisationUpdate completed (1.24wmf6) at 2014-06-03 02:14:12+00:00 [02:15:23] Logged the message, Master [02:26:52] !log LocalisationUpdate completed (1.24wmf7) at 2014-06-03 02:25:48+00:00 [02:26:57] Logged the message, Master [02:27:26] (03PS3) 10Springle: Use m2-master CNAME to make DB rotations neater. This allows a master switch to be a DNS change plus a simple port 3306 tcp redirect with socat until TTL. Should also help if we switch to a haproxy configuration in the future. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131419 [02:33:21] (03CR) 10Springle: [C: 032] "A code review suggests gwtorm would need a restart or have the jvm dns cache stuff tweaked, which doesn't sound fun." [operations/puppet] - 10https://gerrit.wikimedia.org/r/131419 (owner: 10Springle) [02:35:24] (03Abandoned) 10Springle: Weekly logical backup for dbstore100[12] all-shards boxes [operations/puppet] - 10https://gerrit.wikimedia.org/r/131976 (owner: 10Springle) [02:36:14] (03CR) 10Ori.livneh: [C: 031] "awesome, many thanks" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136963 (owner: 10BryanDavis) [02:36:24] <^d> For the love of all that is holy. [02:37:34] <^d> public static String getVariable(String mediawiki, String var) throws IOException{ [02:37:34] <^d> return Command.exec(new String[] { [02:37:34] <^d> "/bin/bash", [02:37:36] <^d> "-c", [02:37:38] <^d> "cd "+mediawiki+" && (echo \"return \\$"+var+"\" | php maintenance/eval.php --conf "+mediawiki+"/LocalSettings.php | sed -e 's/^> // ; /^$/d')"}).trim(); [02:37:40] <^d> } [02:37:42] <^d> Why did we ever allow this? [02:37:44] <^d> :p [02:38:07] heh [02:38:25] what could possibly go wrong [02:40:58] <^d> Surprise surprise, nothing calls this class. [02:41:17] <^d> Oh, it's a main() entry. [02:41:21] * ^d headdesks [02:52:01] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data exceeded the critical threshold [500.0] [02:54:53] (03PS1) 10Chad: Remove searchidx1001 from scap targets [operations/puppet] - 10https://gerrit.wikimedia.org/r/136968 [03:07:01] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [03:17:28] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jun 3 03:16:22 UTC 2014 (duration 16m 21s) [03:17:33] Logged the message, Master [03:30:08] (03PS3) 10Whym: FeaturedFeeds for Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136316 (https://bugzilla.wikimedia.org/66015) [03:42:00] !log revert to lvm snapshot on db1046, xfs being crotchety [03:42:05] Logged the message, Master [03:51:01] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 30 May 2014 18:25:33 UTC [03:55:01] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 18:51:58 UTC [04:02:01] PROBLEM - Puppet freshness on analytics1013 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 18:59:02 UTC [04:06:01] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:03:36 UTC [04:09:01] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:06:50 UTC [04:13:01] PROBLEM - Puppet freshness on analytics1015 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:10:41 UTC [04:15:01] PROBLEM - Puppet freshness on analytics1014 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:12:17 UTC [04:19:01] PROBLEM - Puppet freshness on analytics1009 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:16:05 UTC [04:22:01] PROBLEM - Puppet freshness on analytics1019 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:19:33 UTC [04:25:01] PROBLEM - Puppet freshness on analytics1017 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:22:40 UTC [04:26:01] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:23:15 UTC [04:27:01] PROBLEM - Puppet freshness on analytics1018 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:24:16 UTC [04:30:43] (03PS1) 10Springle: use m2 CNAME for exim [operations/puppet] - 10https://gerrit.wikimedia.org/r/136972 [04:31:01] PROBLEM - Puppet freshness on analytics1016 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:28:14 UTC [04:33:12] (03CR) 10Springle: "Not sure if exim would pick up a DNS change after TTL or if it caches and would need a restart regardless, but all the other M2 services n" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136972 (owner: 10Springle) [04:35:11] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:54:01] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.130 second response time [05:18:01] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 02:17:43 UTC [05:18:41] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 10239 seconds [05:19:11] PROBLEM - MySQL Slave Delay on db1046 is CRITICAL: CRIT replication delay 10167 seconds [05:22:29] (03CR) 10Legoktm: [C: 04-1] "The current username is technically wrong, but I don't think it's worth the breaking change to make it correct, especially since we're goi" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136965 (owner: 10Reedy) [05:23:58] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 9623 seconds Sean Pringle catching up - The acknowledgement expires at: 2014-06-04 08:23:31. [05:23:58] ACKNOWLEDGEMENT - MySQL Slave Delay on db1046 is CRITICAL: CRIT replication delay 9692 seconds Sean Pringle catching up - The acknowledgement expires at: 2014-06-04 08:23:31. [05:31:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:33:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:35:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:36:58] (03CR) 10BryanDavis: "Sounds good to me. Should I be afraid that there are apparently two implementations of sync-apache?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136968 (owner: 10Chad) [05:37:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:39:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:41:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:43:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:45:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:47:09] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Jun 3 05:47:05 UTC 2014 [05:47:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:49:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:51:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:53:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:53:51] (03CR) 10Chad: "Probably." [operations/puppet] - 10https://gerrit.wikimedia.org/r/136968 (owner: 10Chad) [05:55:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:57:29] PROBLEM - Puppet freshness on mw1202 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 05:28:24 UTC [05:58:29] RECOVERY - Puppet freshness on mw1202 is OK: puppet ran at Tue Jun 3 05:58:26 UTC 2014 [05:58:55] <_joe_> oh god [05:59:00] <_joe_> who broke puppet? [05:59:07] <_joe_> good day everyone :/ [06:10:20] mw1202 seems fine [06:10:42] puppet, i mean [06:10:45] the machine itself, though.. [06:12:26] ..is also not in some unusual state [06:13:57] load on neon spiked, but not even to the last 24h maximum [06:21:07] RECOVERY - MySQL Slave Delay on db1046 is OK: OK replication delay 150 seconds [06:21:37] RECOVERY - MySQL Replication Heartbeat on db1046 is OK: OK replication delay 55 seconds [06:32:09] (03PS1) 10Giuseppe Lavagetto: mediawiki: make jobrunners restart work. [operations/puppet] - 10https://gerrit.wikimedia.org/r/136977 [06:32:18] <_joe_> ori: are you still here? [06:32:25] _joe_: hello [06:32:33] <_joe_> if so, don't look at that commit I just made [06:32:35] <_joe_> :) [06:32:40] ok:) [06:33:26] <_joe_> It's one of those cases in which two very brilliant people don't catch a very stupid error [06:33:36] <_joe_> (that would be you and brandon) [06:33:56] <_joe_> oh, also, hard tabs! [06:34:57] (03PS2) 10Giuseppe Lavagetto: mediawiki: make jobrunners restart work. [operations/puppet] - 10https://gerrit.wikimedia.org/r/136977 [06:36:41] _joe_: ah, there's http://docs.puppetlabs.com/references/latest/function.html#fqdnrand [06:37:07] <_joe_> which is notoriously lame if you have very similar names [06:37:20] * ori is also notoriously lame [06:37:39] <_joe_> no you're not :) [06:37:46] there's a seed param [06:37:58] <_joe_> but I got bitten by that function [06:38:12] fair enough [06:38:14] <_joe_> so I reproduced what we do in other places in our code [06:38:37] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: make jobrunners restart work. [operations/puppet] - 10https://gerrit.wikimedia.org/r/136977 (owner: 10Giuseppe Lavagetto) [06:48:27] PROBLEM - MySQL Slave Running on db1046 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Duplicate entry 113904793 for key PRIMARY on query. Defau [06:48:52] <_joe_> springle: I guess you are handling that. [06:51:16] yeah [06:51:25] i broke it a bit [06:51:37] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 328 seconds [06:51:57] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 30 May 2014 18:25:33 UTC [06:52:37] RECOVERY - MySQL Replication Heartbeat on db1046 is OK: OK replication delay 0 seconds [06:53:27] RECOVERY - MySQL Slave Running on db1046 is OK: OK replication [06:55:57] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 18:51:58 UTC [07:02:57] PROBLEM - Puppet freshness on analytics1013 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 18:59:02 UTC [07:06:57] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:03:36 UTC [07:08:01] (03PS3) 10Ori.livneh: Add custom Diamond collector for RCStream [operations/puppet] - 10https://gerrit.wikimedia.org/r/136621 [07:09:57] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:06:50 UTC [07:13:57] PROBLEM - Puppet freshness on analytics1015 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:10:41 UTC [07:15:57] PROBLEM - Puppet freshness on analytics1014 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:12:17 UTC [07:19:57] PROBLEM - Puppet freshness on analytics1009 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:16:05 UTC [07:22:19] (03PS8) 10Ori.livneh: Add rsyslog module and port existing usage [operations/puppet] - 10https://gerrit.wikimedia.org/r/135447 [07:22:57] PROBLEM - Puppet freshness on analytics1019 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:19:33 UTC [07:25:57] PROBLEM - Puppet freshness on analytics1017 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:22:40 UTC [07:26:57] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:23:15 UTC [07:27:44] (to the tune of "amadeus, amadeus") [07:27:48] analytics, analytics! [07:27:57] PROBLEM - Puppet freshness on analytics1018 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:24:16 UTC [07:29:06] <_joe_> lol [07:29:16] <_joe_> should I take a look? yeah I should [07:30:17] <_joe_> ah, admins.pp [07:30:57] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [07:31:57] PROBLEM - Puppet freshness on analytics1016 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:28:14 UTC [07:43:57] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [07:45:51] (03PS1) 10Giuseppe Lavagetto: rcstream: add DNS records for stream.w.o [operations/dns] - 10https://gerrit.wikimedia.org/r/136983 [07:49:22] (03PS2) 10Giuseppe Lavagetto: rcstream: add DNS records for stream.w.o [operations/dns] - 10https://gerrit.wikimedia.org/r/136983 [08:07:10] (03PS3) 10Giuseppe Lavagetto: rcstream: add DNS records for stream.w.o [operations/dns] - 10https://gerrit.wikimedia.org/r/136983 [08:30:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:32:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:34:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:36:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:38:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:40:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:42:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:43:17] (03PS2) 10Faidon Liambotis: use m2 CNAME for exim [operations/puppet] - 10https://gerrit.wikimedia.org/r/136972 (owner: 10Springle) [08:43:23] (03CR) 10Faidon Liambotis: [C: 032 V: 032] use m2 CNAME for exim [operations/puppet] - 10https://gerrit.wikimedia.org/r/136972 (owner: 10Springle) [08:43:39] curious duplicates (?) [08:44:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:46:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:48:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:50:19] <_joe_> that alarm is bogus [08:50:21] <_joe_> btw [08:50:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:52:28] (03CR) 10Filippo Giunchedi: "ouch, looks like there are different settings that affect reporting warnings to stderr/stdout, e.g. http://www.php.net/manual/en/errorfunc" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135133 (owner: 10Aaron Schulz) [08:52:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:54:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:56:05] (03CR) 10Filippo Giunchedi: [C: 031] Removed maxvirtualmemory stuff from jobs-loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/135136 (owner: 10Aaron Schulz) [08:56:56] PROBLEM - Puppet freshness on mc1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 08:28:09 UTC [08:57:36] RECOVERY - Puppet freshness on mc1014 is OK: puppet ran at Tue Jun 3 08:57:34 UTC 2014 [09:15:04] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:53] (03PS1) 10Giuseppe Lavagetto: rcstream: add lvs, modify mountpoint [operations/puppet] - 10https://gerrit.wikimedia.org/r/136990 [09:20:02] (03CR) 10jenkins-bot: [V: 04-1] rcstream: add lvs, modify mountpoint [operations/puppet] - 10https://gerrit.wikimedia.org/r/136990 (owner: 10Giuseppe Lavagetto) [09:21:54] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.024 second response time [09:26:35] (03PS2) 10Giuseppe Lavagetto: rcstream: add lvs, modify mountpoint [operations/puppet] - 10https://gerrit.wikimedia.org/r/136990 [09:26:42] (03PS2) 10Filippo Giunchedi: add mini-dinstall to releases.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/136128 [09:28:37] (03CR) 10Filippo Giunchedi: "indeed mini-dinstall doesn't support splitting into components, so no while keeping mini-dinstall the sources.list will have to be in the " [operations/puppet] - 10https://gerrit.wikimedia.org/r/136128 (owner: 10Filippo Giunchedi) [09:31:23] (03CR) 10Faidon Liambotis: [C: 04-1] Icinga: new command "check_dispatch" for Wikidata [operations/puppet] - 10https://gerrit.wikimedia.org/r/136095 (owner: 10Christopher Johnson (WMDE)) [09:34:50] (03CR) 10Filippo Giunchedi: [C: 031] rcstream: add DNS records for stream.w.o (031 comment) [operations/dns] - 10https://gerrit.wikimedia.org/r/136983 (owner: 10Giuseppe Lavagetto) [09:35:10] any clue why it would takes several minutes to `ls` 40k directories and 40k symbolic links? [09:35:32] I though a directory in the FS had some kind of index that would easily list them [09:36:13] <_joe_> hashar: you thought wrong :) [09:36:34] <_joe_> you have a stat cache for the FS that can help [09:37:09] <_joe_> but I don't remember ls internal working atm [09:37:25] on first access, one still has to stat all those files so :( [09:39:43] (03CR) 10Filippo Giunchedi: [C: 031] Add custom Diamond collector for RCStream (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/136621 (owner: 10Ori.livneh) [09:40:37] ls or ls -l ? [09:40:45] ls doesn't. ls -l does [09:41:21] ahhh [09:42:12] I have to do a path after anyway. At least the cache will be filled up by ls :) [09:43:56] hashar: also modern fs would have indexed directories alright vs e.g. ext2 [09:44:36] <_joe_> godog: not "indexed" in the sense he needed it [09:45:49] <_joe_> (as in, list all files and their attrs) [09:48:32] <_joe_> hashar: what FS is that? [09:48:33] true, not for listing [09:48:56] _joe_: /dev/md0 on / type ext3 (rw,noatime,errors=remount-ro) [09:48:57] <_joe_> godog: he needed a 'db-like' index I'd say [09:49:15] that is just a routine cleanup tasks anyway [09:49:27] I have left the rm calls in the back and they will eventually complete [09:49:29] <_joe_> so what is the command? [09:49:38] rm -fR 2013-??-??_??-??-?? [09:49:47] to delete all directories from 2013 [09:49:53] <_joe_> maybe a find -delete could be faster? [09:50:00] yeah probably [09:50:46] <_joe_> not really, no [09:50:46] + that is on gallium which has terrible disks access (there is like 10M inodes) [09:51:37] <_joe_> ok so, is that remove operation very slow? [09:51:55] <_joe_> and if so, how did you determine the problem was listing files? [09:52:21] doing /bin/ls I have to wait ? :D [09:52:44] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 30 May 2014 18:25:33 UTC [09:52:46] <_joe_> oh so the command was not rm -rF :P [09:52:55] then I do a rm -rF [09:52:59] and that is slow as hell as well [09:53:07] I guess because each dirs have ton of tiny files [09:53:22] <_joe_> which could be expected [09:53:28] <_joe_> did you used the same globbing in the ls? [09:53:45] nop [09:53:54] <_joe_> so ls -d may be faster [09:54:18] <_joe_> oh ok, then nevermind [09:55:59] (03CR) 10Filippo Giunchedi: [C: 031] Labs: Fix beta to work with role::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/134519 (owner: 10BryanDavis) [09:56:44] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 18:51:58 UTC [09:57:08] (03CR) 10Filippo Giunchedi: [C: 031] contint: fix resource ordering for labs slave [operations/puppet] - 10https://gerrit.wikimedia.org/r/136310 (owner: 10Hashar) [10:01:24] thoughts on https://gerrit.wikimedia.org/r/#/c/136317/ ? [10:03:44] PROBLEM - Puppet freshness on analytics1013 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 18:59:02 UTC [10:07:44] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:03:36 UTC [10:10:44] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:06:50 UTC [10:14:44] PROBLEM - Puppet freshness on analytics1015 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:10:41 UTC [10:16:44] PROBLEM - Puppet freshness on analytics1014 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:12:17 UTC [10:20:44] PROBLEM - Puppet freshness on analytics1009 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:16:05 UTC [10:23:44] PROBLEM - Puppet freshness on analytics1019 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:19:33 UTC [10:26:44] PROBLEM - Puppet freshness on analytics1017 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:22:40 UTC [10:27:44] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:23:15 UTC [10:28:44] PROBLEM - Puppet freshness on analytics1018 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:24:16 UTC [10:32:44] PROBLEM - Puppet freshness on analytics1016 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:28:14 UTC [10:46:44] morning #ops - it looks like jobs are stopped [10:48:13] ugh ugly graphs [10:49:04] https://gdash.wikimedia.org/dashboards/jobq/ [10:49:12] yeah, the green line goes away [10:50:52] ugh, let's see [10:51:10] why that jump in system cpu https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Jobrunners+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [10:52:04] Nemo_bis: useful data [10:52:35] <_joe_> Nemo_bis: I think the reason is the fact we are now restarting the jobrunner job hourly [10:53:04] <_joe_> what preoccupies me is the network traffic just disappeared, kinda [10:53:37] <_joe_> but that happened well before I merged my change fixing the restart of the jobrunners [10:53:53] _joe_: it hasn't run anything since 8:00 utc, I believe, accoring to fluorine:/a/mw-log/runJobs.log [10:54:23] <_joe_> manybubbles: lemme check [10:54:30] graphs on gdash agree [10:55:33] indeed the jobs-loop is up but seemingly not picking up jobs, or perhaps died [10:56:08] <_joe_> the jobs-loop is up so I don't get why this happened [10:56:30] <_joe_> I just fixed a change ori did yesterday night [10:57:14] I think PATH=/sbin is too strict, should be more paths in there [10:57:19] <_joe_> https://gerrit.wikimedia.org/r/#/c/136977/ at 8:30 AM [10:57:35] <_joe_> I guess my time, so around 6:30 UTC [10:57:59] <_joe_> yes [10:58:16] !log try restarting mw-job-runner on mw1012 [10:58:22] Logged the message, Master [10:58:51] yep that works btw, manually via service mw-job-runner restart [10:58:58] I see some life :) [10:59:17] <_joe_> godog: so the problem is, we need more env variables, not just PATH to make that work [10:59:46] <_joe_> godog: I'll try to restart it on all jobrunners [11:00:00] <_joe_> but then we need to understand what is needed here [11:00:13] <_joe_> (and the cronjob is exiting correctly) [11:00:43] <_joe_> so, wait until I strace some 'stuck' jobrunners [11:01:43] _joe_: it could be also that PATH=/sbin is too restricted, I don't think there's anything setting it explicitly in jobs-loop.sh [11:02:04] <_joe_> godog: maybe that is the reason [11:02:15] <_joe_> so let me modify by hand the cron on one server [11:04:39] <_joe_> godog: I assumed everything but the init script would have been written with full paths, which is not the case [11:04:56] <_joe_> (I assumed that because the daemon started without complaining) [11:06:12] heh, it doesn't set -e tho [11:08:03] what's the impact of no jobs processed btw? [11:08:11] Nemo_bis manybubbles ^ [11:08:48] (03PS1) 10Giuseppe Lavagetto: jobrunner: restart with full path. [operations/puppet] - 10https://gerrit.wikimedia.org/r/137002 [11:08:50] <_joe_> godog: :/ [11:09:04] <_joe_> godog: we should send an outage report, but later [11:09:53] godog: no search updates, email notifications etc. [11:11:15] <_joe_> godog: we should add -e [11:12:42] IIRC there are plans to un-break jobs-loop as a script too, e.g. rewrite it [11:13:37] <_joe_> !log restarted jobrunners as they were blocked by restarting via cron [11:13:42] Logged the message, Master [11:16:12] (03PS2) 10Giuseppe Lavagetto: jobrunner: restart with full path. [operations/puppet] - 10https://gerrit.wikimedia.org/r/137002 [11:17:46] <_joe_> Nemo_bis: thanks, good catch [11:22:57] (03CR) 10Filippo Giunchedi: [C: 032] jobrunner: restart with full path. [operations/puppet] - 10https://gerrit.wikimedia.org/r/137002 (owner: 10Giuseppe Lavagetto) [11:27:45] we're back btw [11:31:04] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2148: active_shards: 6443: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [11:31:04] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2148: active_shards: 6443: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [11:32:04] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2148: active_shards: 6443: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [11:32:04] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2148: active_shards: 6443: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [11:48:19] stupid health check [12:26:07] (03CR) 10Manybubbles: [C: 031] Remove searchidx1001 from scap targets [operations/puppet] - 10https://gerrit.wikimedia.org/r/136968 (owner: 10Chad) [12:33:36] (03PS14) 10Yuvipanda: toollabs: Add MongoDB role [operations/puppet] - 10https://gerrit.wikimedia.org/r/135442 [12:34:32] (03PS1) 10Manybubbles: Allow Elasticsearch java version to float again [operations/puppet] - 10https://gerrit.wikimedia.org/r/137008 [12:53:48] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 30 May 2014 18:25:33 UTC [12:57:44] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 18:51:58 UTC [13:00:31] _joe_ let me know if you have a minute to talk about graphite, thresholds and monitoring [13:00:53] <_joe_> nuria: I do [13:00:58] <_joe_> in ~ 3 minutes [13:01:06] k [13:01:20] <_joe_> I'll ping you back asap [13:04:49] PROBLEM - Puppet freshness on analytics1013 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 18:59:02 UTC [13:05:21] !log salt * start procps [13:05:26] Logged the message, Master [13:05:28] bblack: ^ [13:08:44] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:03:36 UTC [13:09:13] Deskana|Away: Did you intend to add those hovercard fixes to Thursday's SWAT rather than today's? [13:10:37] <_joe_> nuria: here I am [13:11:13] ok, I have a question about teh monitor_graphite_threshold [13:11:30] handy wrapper thta you wrote [13:11:34] *that [13:11:44] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:06:50 UTC [13:12:09] would you advise including it on the role definition or elsewhere? [13:12:47] like in the module of the 'feature' in question [13:13:02] i was about to enter the threshold on eventlogging.pp role [13:13:59] <_joe_> nuria: I'd say in the roles [13:14:19] <_joe_> oh, wait, look at what I did with mediawiki [13:14:50] <_joe_> I created a subclass of the mediawiki role [13:15:03] <_joe_> which is then included by the graphite role [13:15:06] I have only seen this one: https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/graphite.pp#L201 [13:15:17] let me see [13:15:45] PROBLEM - Puppet freshness on analytics1015 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:10:41 UTC [13:16:52] _joe_: where is teh meadiwiki one defined? [13:16:55] *the [13:17:00] <_joe_> no, sorry, it's in the mediawiki module [13:17:03] i do not find it in teh puppet repo [13:17:30] let me see, i do not work with mediawiki code but i must have it somewhere [13:17:44] PROBLEM - Puppet freshness on analytics1014 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:12:17 UTC [13:18:32] <_joe_> nuria: not merged yet :P https://gerrit.wikimedia.org/r/#/c/136292/ [13:18:37] ahhh [13:18:41] let me see [13:19:37] aha [13:19:48] * _joe_ d'oh [13:20:03] <_joe_> ok *that* is the way to import checks [13:20:03] <_joe_> *if* your metric is not a global one but runs per-server [13:20:03] <_joe_> then put the check in the relevant role [13:20:31] (03PS2) 10Giuseppe Lavagetto: monitoring: monitor mediawiki jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/136292 [13:20:55] ok, will do and add you to CR if that is OK [13:21:03] <_joe_> btw, going to modify and merge it [13:21:18] ok, i will keep an eye on it [13:21:34] you did not have an equivalent commit for vagrant right? [13:21:39] it was purely production [13:21:44] PROBLEM - Puppet freshness on analytics1009 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:16:05 UTC [13:23:10] <_joe_> nope sorry [13:24:44] PROBLEM - Puppet freshness on analytics1019 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:19:33 UTC [13:26:23] _joe_: the presence of the class in teh role "enables" the checks right? [13:26:33] there is no other code needed, correct? [13:27:44] PROBLEM - Puppet freshness on analytics1017 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:22:40 UTC [13:28:18] <_joe_> nuria: ? [13:28:30] <_joe_> not sure I got what you meant [13:28:44] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:23:15 UTC [13:29:43] _joe_: that checks become effective once deployed [13:29:44] PROBLEM - Puppet freshness on analytics1018 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:24:16 UTC [13:29:54] (03PS3) 10Giuseppe Lavagetto: monitoring: monitor mediawiki jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/136292 [13:30:19] <_joe_> nuria: checks are effective as soon as you include the class including them in some node [13:30:35] ok, gracias...... [13:31:44] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: monitor mediawiki jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/136292 (owner: 10Giuseppe Lavagetto) [13:33:44] PROBLEM - Puppet freshness on analytics1016 is CRITICAL: Last successful Puppet run was Mon 02 Jun 2014 19:28:14 UTC [13:46:29] (03CR) 10Ottomata: [C: 032] use analytics-users group vs. stats group [operations/puppet] - 10https://gerrit.wikimedia.org/r/136919 (owner: 10Dzahn) [13:47:46] (03PS2) 10Ottomata: Allow Elasticsearch java version to float again [operations/puppet] - 10https://gerrit.wikimedia.org/r/137008 (owner: 10Manybubbles) [13:47:53] (03CR) 10Ottomata: [C: 032 V: 032] Allow Elasticsearch java version to float again [operations/puppet] - 10https://gerrit.wikimedia.org/r/137008 (owner: 10Manybubbles) [13:56:08] (03CR) 10Andrew Bogott: [C: 031] "This looks fine, but I can also just change the l10nupdate_gid in ldap so that prod and labs match -- would that be helpful?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/134519 (owner: 10BryanDavis) [14:00:05] <_joe_> andrewbogott: I'm breaking things on testlabs-puppet2-* :) [14:00:20] _joe_: they're all yours! [14:03:20] (03PS4) 10Ori.livneh: Add custom Diamond collector for RCStream [operations/puppet] - 10https://gerrit.wikimedia.org/r/136621 [14:04:07] (03CR) 10Ori.livneh: [C: 032 V: 032] Add custom Diamond collector for RCStream [operations/puppet] - 10https://gerrit.wikimedia.org/r/136621 (owner: 10Ori.livneh) [14:08:47] (03CR) 10Ori.livneh: [C: 031] rcstream: add lvs, modify mountpoint [operations/puppet] - 10https://gerrit.wikimedia.org/r/136990 (owner: 10Giuseppe Lavagetto) [14:12:18] (03PS2) 10Rush: use analytics-users group vs. stats group [operations/puppet] - 10https://gerrit.wikimedia.org/r/136919 (owner: 10Dzahn) [14:12:42] (03CR) 10Rush: [C: 032 V: 032] "daniel, I'm merging this just because puppet is actually broken in places and this is the fix. Hope that's cool" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136919 (owner: 10Dzahn) [14:13:42] (03Abandoned) 10Rush: (WIP) Completely overhaul admins.pp & modularize [operations/puppet] - 10https://gerrit.wikimedia.org/r/107848 (owner: 10Faidon Liambotis) [14:14:34] PROBLEM - Disk space on analytics1012 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 83955 MB (4% inode=99%): /var/lib/hadoop/data/d 109375 MB (5% inode=99%): /var/lib/hadoop/data/e 123273 MB (6% inode=99%): /var/lib/hadoop/data/f 74729 MB (3% inode=99%): /var/lib/hadoop/data/h 89954 MB (4% inode=99%): /var/lib/hadoop/data/i 115407 MB (6% inode=99%): /var/lib/hadoop/data/j 109946 MB (5% inode=99%): /var/lib/hadoop/da [14:16:44] RECOVERY - Puppet freshness on analytics1009 is OK: puppet ran at Tue Jun 3 14:16:34 UTC 2014 [14:18:34] PROBLEM - Disk space on analytics1012 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 83877 MB (4% inode=99%): /var/lib/hadoop/data/d 109007 MB (5% inode=99%): /var/lib/hadoop/data/e 122960 MB (6% inode=99%): /var/lib/hadoop/data/f 74792 MB (3% inode=99%): /var/lib/hadoop/data/h 89397 MB (4% inode=99%): /var/lib/hadoop/data/i 113085 MB (6% inode=99%): /var/lib/hadoop/data/j 107785 MB (5% inode=99%): /var/lib/hadoop/da [14:19:24] RECOVERY - Puppet freshness on analytics1019 is OK: puppet ran at Tue Jun 3 14:19:17 UTC 2014 [14:21:54] RECOVERY - Puppet freshness on analytics1011 is OK: puppet ran at Tue Jun 3 14:21:51 UTC 2014 [14:22:14] RECOVERY - Puppet freshness on analytics1017 is OK: puppet ran at Tue Jun 3 14:22:11 UTC 2014 [14:23:34] RECOVERY - Puppet freshness on analytics1020 is OK: puppet ran at Tue Jun 3 14:23:28 UTC 2014 [14:24:34] RECOVERY - Puppet freshness on analytics1018 is OK: puppet ran at Tue Jun 3 14:24:24 UTC 2014 [14:26:32] anomie: I did. Wanted to let it sit for a few days just in case. [14:26:46] (03PS2) 10Faidon Liambotis: mediawiki: add python-imaging to required packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/136007 (owner: 10Filippo Giunchedi) [14:26:50] (03PS3) 10Faidon Liambotis: mediawiki: add python-imaging to required packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/136007 (owner: 10Filippo Giunchedi) [14:26:51] Deskana: Ok, just checking. [14:26:56] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mediawiki: add python-imaging to required packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/136007 (owner: 10Filippo Giunchedi) [14:26:57] anomie: Thanks. :) [14:27:04] RECOVERY - Puppet freshness on analytics1016 is OK: puppet ran at Tue Jun 3 14:27:00 UTC 2014 [14:28:44] RECOVERY - Puppet freshness on analytics1013 is OK: puppet ran at Tue Jun 3 14:28:37 UTC 2014 [14:33:54] RECOVERY - Puppet freshness on analytics1027 is OK: puppet ran at Tue Jun 3 14:33:49 UTC 2014 [14:36:44] RECOVERY - Puppet freshness on analytics1012 is OK: puppet ran at Tue Jun 3 14:36:36 UTC 2014 [14:41:14] RECOVERY - Puppet freshness on analytics1015 is OK: puppet ran at Tue Jun 3 14:41:06 UTC 2014 [14:41:44] RECOVERY - Puppet freshness on analytics1014 is OK: puppet ran at Tue Jun 3 14:41:41 UTC 2014 [14:41:52] (03CR) 10Dzahn: "@20after4: You said you've been testing phab with vagrant and worked on puppetizing. I've been wondering..should we merge the initial comm" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [14:45:04] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:38] (03CR) 10Rush: "@dzahn I think let's not for now? I'm not sure what the end product here will look like. Gonig to try to sync up with mukunda this week " [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [14:45:54] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.017 second response time [14:47:12] (03CR) 10Dzahn: "@20after4: re database questions: for the actual production phab we have to request them from springle (dba). for labs, either a db instan" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [14:48:21] (03CR) 10Dzahn: initial commit for a phabricator module (WIP) (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [14:49:16] <_joe_> twentyafterfour, mutante : please test the phabricator module with puppet 3 :) [14:54:19] (03PS1) 10Hashar: admin: files for /home/hashar [operations/puppet] - 10https://gerrit.wikimedia.org/r/137019 [14:55:34] chasemp: hey! would puppet manage my home now? I have created a lame change to populate some files in my /home/hashar but not sure that is how it works ( https://gerrit.wikimedia.org/r/#/c/137019/ ) [14:55:59] hashar: it should, details here http://git.wikimedia.org/blob/operations%2Fpuppet.git/65d82b3f8d06bd8087e5f083e7ccf75612748591/modules%2Fadmin%2FREADME [14:56:46] <_joe_> oh so I can add cyberwarfare.pl on all servers, neat! [14:56:55] gi11es: Ping, SWAT in about 3.5 minutes [14:57:01] <_joe_> I'm sure this will trigger the NSA alarms [14:57:04] anomie: pong [14:57:33] <_joe_> anomie: you are swatting? if so, can you verify swat includes mw1053? [14:57:42] _joe_: Sure [14:57:54] <_joe_> anomie: and that it works correctly, of course :) [14:58:05] <_joe_> I'd like to put that server back in rotation ASAP [14:58:18] chasemp: yeah that is smart :D [14:58:53] chasemp: I guess folks will complain when I start putting in my whole git repo in there [14:58:53] _joe_: Can I assume that the SWAT will be including mw1053 unless there's an error message about it when scapping, or do I need to manually do that server? [14:59:08] hashar: assuming it's all reasonable stuff I don't think so ? I have pretty much the same amount of stuff I bet :) [14:59:25] I want to pull all that from a fileshare (puppet) thing and export it, etc [14:59:28] but this is ok for now I think [14:59:38] <_joe_> anomie: it should be automatically included I think [14:59:44] _joe_: ok [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140603T1500) [15:00:17] * anomie begins SWAT [15:00:51] chasemp: I can survive with just a few dot files :D that is already going to be a huge improvement [15:00:55] chasemp: thanks! [15:01:26] gi11es: I'm going to do the config change first, while I look up whether there's a decent alternative to scap for the i18n change [15:01:33] (03CR) 10Anomie: [C: 032] Lower sampling for enwiki and dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136717 (owner: 10Gilles) [15:01:52] (03Merged) 10jenkins-bot: Lower sampling for enwiki and dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136717 (owner: 10Gilles) [15:02:12] anomie: enwiki folks have told us the i18n change is a blocker for launch. but it can happen in the launch window later today if you prefer [15:02:23] (03CR) 10Rush: "HI! this can be added to the existing admin module now. Details are here:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/76678 (owner: 10Tim Starling) [15:03:04] (03CR) 10Hashar: [C: 031] beta: Remove File['/usr/local/apache/common'] from ::beta::common [operations/puppet] - 10https://gerrit.wikimedia.org/r/136963 (owner: 10BryanDavis) [15:03:23] (03CR) 10Rush: [C: 032 V: 032] admin: files for /home/hashar [operations/puppet] - 10https://gerrit.wikimedia.org/r/137019 (owner: 10Hashar) [15:03:34] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Lower MediaViewer sampling for enwiki and dewiki [[gerrit:136717]] (duration: 00m 14s) [15:03:39] Logged the message, Master [15:03:46] gi11es: ^ Test please [15:03:51] hashar: assumed you needed someone to merge that? all good [15:04:05] chasemp: indeed. I have no root :D [15:04:48] that will definitely please tim [15:05:01] hashar: let me know how taht works out [15:05:29] notice: /File[/home/hashar/.gitconfig]/content: [15:05:31] works fine! [15:06:30] anomie: I don't see the effect on enwiki [15:06:40] anomie: as in, the new value isn't there, still the old [15:06:46] I'll check dewiki [15:06:52] (03PS2) 10Rush: rm old admins::mortals class, replaced by yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/136913 (owner: 10Matanya) [15:07:00] gi11es: That's the sampling rate change, not the i18n [15:07:06] anomie: yes [15:07:09] (03CR) 10Rush: [C: 032 V: 032] "gtg thanks matanya" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136913 (owner: 10Matanya) [15:07:15] anomie: sampling rate is still the old value [15:08:56] anomie: I'm seeing the new sampling factor on dewiki [15:09:39] gi11es: But still not enwiki? [15:09:50] anomie: seems like it's working now on enwiki, I'm double checking [15:10:35] anomie: yep, confirmed, works on both enwiki and dewiki. all clear on that changeset [15:10:38] _joe_: It seems I can't log into mw1053 to check it myself. If you want to check, line 10869 of /usr/local/apache/common-local/wmf-config/InitialiseSettings.php (I think that's the right path) should be "'dewiki' => array(" [15:10:55] (03PS6) 10Alexandros Kosiaris: dns recurses: add firewll [operations/puppet] - 10https://gerrit.wikimedia.org/r/133515 (owner: 10Matanya) [15:11:39] <_joe_> anomie: ok I'll do in a few [15:11:56] gi11es: Doing the other one now [15:13:20] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] dns recurses: add firewll [operations/puppet] - 10https://gerrit.wikimedia.org/r/133515 (owner: 10Matanya) [15:16:42] (03PS1) 10Rush: admins.pp deprecation warning [operations/puppet] - 10https://gerrit.wikimedia.org/r/137023 [15:17:04] PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:14] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data exceeded the critical threshold [500.0] [15:17:24] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:24] PROBLEM - Swift HTTP backend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:24] PROBLEM - Swift HTTP frontend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:51] 500s went way up https://gdash.wikimedia.org/dashboards/reqerror/ [15:17:57] anomie: ^ [15:17:59] gi11es: ^ [15:18:15] RECOVERY - Swift HTTP backend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 1.008 second response time [15:18:15] RECOVERY - Swift HTTP frontend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 137 bytes in 0.004 second response time [15:18:27] greg-g: https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=pdns+questions&vl=&x=&n=&hreg[]=%28chromium%7Chydrogen%29.wikimedia.org&mreg[]=pdns_questions>ype=stack&glegend=show&aggregate=1&embed=1&_=1401808138308 [15:18:34] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:18:36] that is the reason, my fault. already fixed [15:18:40] whew [15:18:40] <_joe_> hey, what was that? [15:18:44] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:18:50] greg-g: Seems unrelated [15:18:54] RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.025 second response time [15:19:01] anomie: yeah, sorry, go on [15:19:11] akosiaris: what are chromium/hydrogen? [15:19:14] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [15:19:21] recursive dns servers [15:19:24] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 68872 bytes in 0.379 second response time [15:19:24] ah [15:19:34] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.083 second response time [15:20:09] <_joe_> akosiaris: how do you explain that? [15:20:29] I just merged 43557e4 [15:20:45] which had a typo I introduced and did not set the correct rules [15:21:06] <_joe_> oh ok [15:21:11] I immediately reverted to a non firewall state but icinga was quicker [15:21:43] <_joe_> eh, play with dns and firewalls [15:21:47] <_joe_> the two most delicate things we manage :P [15:22:17] Missing semicolon before "}" [15:22:18] <_joe_> (meaning even a small error results in catastrophic outages) [15:22:18] :-( [15:22:30] a crappy semicolon... sigh [15:22:30] <_joe_> akosiaris: shit happens [15:23:51] !log anomie Started scap: SWAT: Update i18n for MultimediaViewer [[gerrit:136718]] [15:23:55] Logged the message, Master [15:24:01] (03CR) 10Dzahn: [C: 032] admins.pp deprecation warning [operations/puppet] - 10https://gerrit.wikimedia.org/r/137023 (owner: 10Rush) [15:24:34] RECOVERY - Disk space on analytics1012 is OK: DISK OK [15:24:37] (03PS1) 10Giuseppe Lavagetto: puppet3: make puppet::self::master work in puppet 3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/137025 [15:25:19] (03PS1) 10Reedy: Stop sending IRC RC to PMTPA [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137026 [15:25:21] (03PS1) 10Alexandros Kosiaris: Fix missing semicolon in ferm in role::dns::recursor [operations/puppet] - 10https://gerrit.wikimedia.org/r/137027 [15:25:25] (03PS1) 10Rush: adding user notes to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/137028 [15:25:36] (03PS2) 10Rush: admins.pp deprecation warning [operations/puppet] - 10https://gerrit.wikimedia.org/r/137023 [15:25:44] (03CR) 10Rush: [C: 032 V: 032] admins.pp deprecation warning [operations/puppet] - 10https://gerrit.wikimedia.org/r/137023 (owner: 10Rush) [15:25:56] (03PS2) 10Rush: adding user notes to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/137028 [15:26:01] (03CR) 10Rush: [C: 032 V: 032] adding user notes to admin yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/137028 (owner: 10Rush) [15:26:28] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix missing semicolon in ferm in role::dns::recursor [operations/puppet] - 10https://gerrit.wikimedia.org/r/137027 (owner: 10Alexandros Kosiaris) [15:31:42] there is a rainman shell account, can't find on contact list not sure who that is on irc? rainman are you out there? [15:31:53] He's not really about anymore [15:32:01] He used to do the old search infrastructure [15:33:14] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [15:33:30] bd808, ^d, _joe_: scap for the SWAT tells me that sync-common on searchidx1001 complained about no space left. [15:33:40] thanks Reedy I'm going to purge them and they can be readded down teh road [15:33:58] anomie: Known issue. Chad has a patch pending to drop it from the scap entirely [15:34:37] (03PS1) 10Alexandros Kosiaris: Specify table and chain for role::dns::recursor notrack rules [operations/puppet] - 10https://gerrit.wikimedia.org/r/137029 [15:35:19] bd808: I also got an unhandled error, command appears to have been "sudo -u mwdeploy -n -- /usr/bin/rsync --archive --delete-delay --delay-updates --compress --delete --exclude=**/.svn/lock --exclude=**/.git/objects --exclude=**/.git/**/objects --exclude=**/cache/l10n/*.cdb --no-perms mw1161.eqiad.wmnet::common /usr/local/apache/common-local", returned non-zero exit status 12 [15:36:16] (03CR) 10Alexandros Kosiaris: [C: 032] Specify table and chain for role::dns::recursor notrack rules [operations/puppet] - 10https://gerrit.wikimedia.org/r/137029 (owner: 10Alexandros Kosiaris) [15:36:37] anomie: "12 Error in rsync protocol data stream" so the rsync glitched talking to the server. Can you tell which host that was on? [15:37:36] bd808: Only that I see "mw1161" in the middle of the command there. [15:38:26] I'll look it up on fluorine. That hostname is the rsync server that was being talked to [15:38:56] bd808: https://dpaste.de/dJc2 is the raw text from the console so far [15:39:51] (03PS1) 10Rush: rainman account absented [operations/puppet] - 10https://gerrit.wikimedia.org/r/137030 [15:40:39] anomie: I think all of those error messages are about searchidx1001.eqiad.wmnet being out of disk space [15:40:57] (03CR) 10jenkins-bot: [V: 04-1] rainman account absented [operations/puppet] - 10https://gerrit.wikimedia.org/r/137030 (owner: 10Rush) [15:40:58] Reedy: thanks on https://gerrit.wikimedia.org/r/#/c/136964/ [15:41:47] !log anomie Finished scap: SWAT: Update i18n for MultimediaViewer [[gerrit:136718]] (duration: 17m 56s) [15:41:52] Logged the message, Master [15:41:56] gi11es: ^ Check please [15:42:06] (03CR) 10Rush: [C: 031] "looks good to me." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137026 (owner: 10Reedy) [15:42:50] (03CR) 10Dzahn: [C: 031] rainman account absented [operations/puppet] - 10https://gerrit.wikimedia.org/r/137030 (owner: 10Rush) [15:42:56] (03PS2) 10Dzahn: rainman account absented [operations/puppet] - 10https://gerrit.wikimedia.org/r/137030 (owner: 10Rush) [15:42:59] anomie: works, verified on commons [15:43:09] (03CR) 10Rush: [C: 032 V: 032] rainman account absented [operations/puppet] - 10https://gerrit.wikimedia.org/r/137030 (owner: 10Rush) [15:43:18] * anomie is done with SWAT [15:43:46] (03CR) 10BryanDavis: "@Andrew Bogott Changing the gid to match in ldap and prod would be very helpful. See bug 65588." [operations/puppet] - 10https://gerrit.wikimedia.org/r/134519 (owner: 10BryanDavis) [15:44:55] anomie: thanks! [15:44:59] !log merged https://gerrit.wikimedia.org/r/#/c/133515/ which enabled ferm on hydrogen/chromium [15:45:04] Logged the message, Master [15:47:40] (03PS1) 10Rush: syntax error in admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/137031 [15:49:00] anomie: thank you [15:51:11] (03PS2) 10Dzahn: syntax error in admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/137031 (owner: 10Rush) [15:52:23] (03CR) 10jenkins-bot: [V: 04-1] syntax error in admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/137031 (owner: 10Rush) [15:52:25] (03PS3) 10Dzahn: syntax error in admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/137031 (owner: 10Rush) [15:54:23] (03CR) 10Dzahn: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137031 (owner: 10Rush) [15:54:44] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Fri 30 May 2014 18:25:33 UTC [15:56:20] (03CR) 10Dzahn: [C: 032] syntax error in admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/137031 (owner: 10Rush) [15:56:38] (03CR) 10Rush: [C: 032] syntax error in admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/137031 (owner: 10Rush) [16:09:04] (03CR) 10Giuseppe Lavagetto: [C: 031] contint: fix resource ordering for labs slave [operations/puppet] - 10https://gerrit.wikimedia.org/r/136310 (owner: 10Hashar) [16:12:21] i think the localization cache on enwiki beta-hhvm is broken [16:12:45] (03CR) 10Chad: [C: 031] "lgtm, merge when you're ready." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/135088 (owner: 10MaxSem) [16:14:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "You're sadly right, hence -1 on this patch. But!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136820 (owner: 10Ori.livneh) [16:15:57] (03PS2) 10Giuseppe Lavagetto: [gdash] Add yearly graphs for frontend performance [operations/puppet] - 10https://gerrit.wikimedia.org/r/136631 (owner: 10Nemo bis) [16:16:19] (03CR) 10Giuseppe Lavagetto: [C: 032] [gdash] Add yearly graphs for frontend performance [operations/puppet] - 10https://gerrit.wikimedia.org/r/136631 (owner: 10Nemo bis) [16:18:01] (03PS15) 10Yuvipanda: toollabs: Add MongoDB role [operations/puppet] - 10https://gerrit.wikimedia.org/r/135442 [16:19:44] PROBLEM - Puppet freshness on tungsten is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 13:19:16 UTC [16:21:54] RECOVERY - Puppet freshness on labstore1001 is OK: puppet ran at Tue Jun 3 16:21:49 UTC 2014 [17:00:04] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:04] jackmcbarn: I wouldn't be too surprised if that was the case. The beta-hhvm instances haven't gotten much attention for the last month. I'll take a look in a bit when I'm out of meetings. [17:00:07] Nemo_bis: there's a bug affecting puppet on tungsten so your change won't be visible for a bit [17:00:07] just fyi [17:00:07] .seen grrrit-wm [17:00:07] mutante: labs has network issues [17:00:07] correction: labs has labs issues :P [17:00:07] YuviPanda: oh, i should have known it's related. thx [17:00:07] paravoid: :D [17:00:07] mutante: I'll make sure it comes back up once labs is back up [17:00:07] YuviPanda: thank you [17:00:07] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:03:43] (03CR) 10Andrew Bogott: [C: 032] toollabs: Add MongoDB role [operations/puppet] - 10https://gerrit.wikimedia.org/r/135442 (owner: 10Yuvipanda) [17:04:56] (03PS2) 10Withoutaname: Remove flaggedrevs-specific user groups from mediawiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134969 [17:04:58] (03PS4) 10Withoutaname: Create 'noratelimit' user group on dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130809 (https://bugzilla.wikimedia.org/57819) [17:05:00] (03CR) 10Dzahn: [C: 032] Remove searchidx1001 from scap targets [operations/puppet] - 10https://gerrit.wikimedia.org/r/136968 (owner: 10Chad) [17:05:08] (03PS1) 10Ori.livneh: Fix typo in 2b16da9b4137 [operations/puppet] - 10https://gerrit.wikimedia.org/r/137038 [17:05:16] (03CR) 10Rush: [C: 031] "thanks man, you are quick" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137038 (owner: 10Ori.livneh) [17:05:18] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix typo in 2b16da9b4137 [operations/puppet] - 10https://gerrit.wikimedia.org/r/137038 (owner: 10Ori.livneh) [17:05:26] * ori pats grrrit-wm [17:07:10] ori: ah ok; no worries, I want to add two more (sets of) graphs before putting that into use anyway :) [17:08:11] ori: woo, didn't lose any changes during the outage! :) [17:08:20] * YuviPanda pats grrrit-wm [17:08:37] (03CR) 10Manybubbles: [C: 031] "Same as Chad." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/135088 (owner: 10MaxSem) [17:12:00] YuviPanda: :) and it did not miss anything:) [17:12:05] (03PS1) 10Ori.livneh: mediawiki::monitor::graphite: monitor thresholds over 1hr interval [operations/puppet] - 10https://gerrit.wikimedia.org/r/137043 [17:12:15] chasemp, _joe_ ^ [17:12:37] some minor linting in that change too [17:16:09] (03PS2) 10Dzahn: run all maintenance crons as apache user [operations/puppet] - 10https://gerrit.wikimedia.org/r/136118 [17:20:11] (03PS1) 10Ori.livneh: mediawiki::sync: Drop File['/apache'] [operations/puppet] - 10https://gerrit.wikimedia.org/r/137045 [17:29:17] (03PS1) 10Ori.livneh: Move Nrpe::Monitor_service[twemproxy] to twemproxy::monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/137047 [17:31:38] ori, I'm back to looking at txstatsd; we're currently configured to use the ConfigurableMessageProcessor as you pointed out last week; do you know why? That processor is expressly not StatsD compliant? [17:32:18] (03PS1) 10Dzahn: add newline to researchdb pw file, easier to read [operations/puppet] - 10https://gerrit.wikimedia.org/r/137049 [17:32:19] e.g. if we set statsd-compliance=1 or processor=MessageProcessor we'd get much closer behaviour to the original protocol [17:32:48] (03CR) 10Dzahn: [C: 032] add newline to researchdb pw file, easier to read [operations/puppet] - 10https://gerrit.wikimedia.org/r/137049 (owner: 10Dzahn) [17:34:24] (03PS1) 10Rush: admin yam to aluminium.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/137050 [17:34:41] (03CR) 10Rush: [C: 032 V: 032] admin yam to aluminium.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/137050 (owner: 10Rush) [17:35:29] (03PS9) 10Ori.livneh: Add rsyslog module and port existing usage [operations/puppet] - 10https://gerrit.wikimedia.org/r/135447 [17:35:59] (03PS1) 10Dzahn: delete duplicate sync-apache, now in module [operations/puppet] - 10https://gerrit.wikimedia.org/r/137051 [17:36:01] (03CR) 10Nemo bis: [C: 031] Remove flaggedrevs-specific user groups from mediawiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134969 (owner: 10Withoutaname) [17:36:13] mwalker: i think my reasoning was frivolous, something about the statsd one forcing a 'count' suffix on all count metrics or something like that [17:36:32] mwalker: but you should ask chasemp if he intends to replace it with another statsd daemon, i think he was looking to do that [17:37:21] err: Could not request certificate: The certificate retrieved from the master does not match the agent's private key. [17:37:24] wtf [17:37:32] not root? [17:37:32] ori, ah ya; it has all sorts of fun prefixes :p [17:37:46] paravoid: the rsyslog module update above declares kafkatee's rsyslog conf too. this leaves /etc/rsyslog.d/postfix on dataset2 which isn't puppetized and should be clobbered [17:37:51] ori: oooh, of course:) not used to the non-root login yet,, haha [17:40:09] (03CR) 10Ori.livneh: "PS8/9 declares varnishkafka and kafkatee's rsyslog confs too. this leaves /etc/rsyslog.d/postfix on dataset2 which isn't puppetized and s" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135447 (owner: 10Ori.livneh) [17:43:41] mutante: could you look at https://gerrit.wikimedia.org/r/#/c/136963/ and https://gerrit.wikimedia.org/r/#/c/136830/ possibly? [17:47:10] (03CR) 10Dzahn: [C: 032] "yes, defined in sync.pp as the same link" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136963 (owner: 10BryanDavis) [17:47:25] (03PS1) 10Krinkle: Set $wgIncludejQueryMigrate = true; for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137053 (https://bugzilla.wikimedia.org/44740) [17:47:41] (03CR) 10jenkins-bot: [V: 04-1] Set $wgIncludejQueryMigrate = true; for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137053 (https://bugzilla.wikimedia.org/44740) (owner: 10Krinkle) [17:48:15] ori: one yes, one no :) [17:48:44] (03CR) 10Reedy: [C: 04-1] Set $wgIncludejQueryMigrate = true; for all wikis (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137053 (https://bugzilla.wikimedia.org/44740) (owner: 10Krinkle) [17:48:46] (03PS2) 10Krinkle: Set $wgIncludejQueryMigrate = true; for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137053 (https://bugzilla.wikimedia.org/44740) [17:48:56] mutante: the lines changed in the diff are misleading, it's just pep8 fixes. but no worries, thanks for the review on bd808|BUFFER's patch [17:49:08] (03CR) 10Krinkle: "Thx Jenkins." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137053 (https://bugzilla.wikimedia.org/44740) (owner: 10Krinkle) [17:49:54] (03CR) 10Dzahn: [C: 032] run all maintenance crons as apache user [operations/puppet] - 10https://gerrit.wikimedia.org/r/136118 (owner: 10Dzahn) [17:51:37] (03PS1) 10Rush: admin on aluminum [operations/puppet] - 10https://gerrit.wikimedia.org/r/137055 [17:54:18] chasemp: https://gerrit.wikimedia.org/r/#/c/137043/ will unbreak puppet on tungsten (fixes _joe_'s change) [17:57:19] (03PS1) 10Dzahn: Merge "run all maintenance crons as apache user" into production [operations/puppet] - 10https://gerrit.wikimedia.org/r/137057 [17:57:25] crap [17:57:46] (03Abandoned) 10Dzahn: Merge "run all maintenance crons as apache user" into production [operations/puppet] - 10https://gerrit.wikimedia.org/r/137057 (owner: 10Dzahn) [17:58:15] Krinkle: when's that want deploying? [17:59:06] (03Abandoned) 10Rush: admin on aluminum [operations/puppet] - 10https://gerrit.wikimedia.org/r/137055 (owner: 10Rush) [17:59:40] (03PS15) 10Dzahn: Move logs to /var/log/mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/83574 (owner: 10Reedy) [18:00:00] (03PS1) 10Rush: clean up fr account stuff in prod for admin [operations/puppet] - 10https://gerrit.wikimedia.org/r/137058 [18:00:04] Reedy, greg-g: The time is nigh to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140603T1800) [18:00:45] (03CR) 10Rush: [C: 032 V: 032] "cleared with jeff green" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137058 (owner: 10Rush) [18:01:47] (03PS1) 10Reedy: All non wikipedias to 1.24wmf7 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137060 [18:02:08] (03CR) 10Dzahn: [C: 032] Move logs to /var/log/mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/83574 (owner: 10Reedy) [18:03:33] (03CR) 10Reedy: [C: 032] All non wikipedias to 1.24wmf7 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137060 (owner: 10Reedy) [18:03:41] (03Merged) 10jenkins-bot: All non wikipedias to 1.24wmf7 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137060 (owner: 10Reedy) [18:04:29] should the mediawiki logdir be defined in mediawiki/manifests/mwlogdir or in manifests/misc/maintenance .. grmbl [18:05:07] we only really need it on terbium for the crons, but every mw install having it also sounds ok [18:05:38] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All non wikipedias to 1.24wmf7 [18:05:42] Logged the message, Master [18:06:15] (03PS1) 10Ori.livneh: More refactoring for role::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/137061 [18:06:59] mutante: mediawiki/manifests imo [18:07:54] !log reedy Synchronized docroot and w: (no message) (duration: 00m 14s) [18:07:58] Logged the message, Master [18:08:21] 1 PHP Warning: inet_pton() [function.inet-pton]: Unrecognized address unknown in /usr/local/apache/common-local/php-1.24wmf7/includes/libs/IPSet.php on line 171 [18:08:23] ori: fair, i just also want a subdir inside it for some crons [18:10:34] (03CR) 10Aaron Schulz: [C: 031] Move Nrpe::Monitor_service[twemproxy] to twemproxy::monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/137047 (owner: 10Ori.livneh) [18:10:43] (03PS2) 10Ori.livneh: More refactoring for role::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/137061 [18:13:12] (03PS2) 10Reedy: Stop sending IRC RC to PMTPA [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137026 [18:13:24] (03CR) 10Reedy: [C: 032] "[19:11:32] * Connecting to 208.80.152.178:6667..." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137026 (owner: 10Reedy) [18:13:30] (03Merged) 10jenkins-bot: Stop sending IRC RC to PMTPA [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137026 (owner: 10Reedy) [18:13:56] (03PS1) 10Ori.livneh: role::mediawiki: drop redundant role descriptions [operations/puppet] - 10https://gerrit.wikimedia.org/r/137065 [18:14:38] !log reedy Synchronized wmf-config/: Stop sending IRC RC to PMTPA (duration: 00m 17s) [18:14:42] Logged the message, Master [18:15:04] ori: when is the WS RC stream scheduled to turn on? [18:16:06] Hmmm [18:16:06] That isn't good [18:16:20] Why are we getting APC spam today? [18:16:44] (03PS1) 10Ori.livneh: role::mediawiki: drop $run_jobs_enabled param from role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/137068 [18:16:52] YuviPanda: when https://gerrit.wikimedia.org/r/#/c/136990/ is merged [18:17:42] (03PS1) 10Dzahn: don't define /var/log/mediawiki in maintenance [operations/puppet] - 10https://gerrit.wikimedia.org/r/137069 [18:17:42] YuviPanda: the servers are up and receiving changes from the mediawikis, just no public ip yet [18:18:16] ori: any plans? [18:18:36] "soon" [18:18:42] plans to do what? merge it? i think _joe_ will do it when he's around. he just went to sleep a couple of hours ago [18:18:57] ori: oh, so I guess it'll be up in, say, a week at worst? [18:19:15] YuviPanda: yeah [18:19:44] soon is soon come http://www.thingsjamaicanslove.com/ramblings/soon_come_what_does_it_really_mean.html [18:19:52] YuviPanda: > now [18:20:31] (03CR) 10Ori.livneh: [C: 031] don't define /var/log/mediawiki in maintenance [operations/puppet] - 10https://gerrit.wikimedia.org/r/137069 (owner: 10Dzahn) [18:21:12] (03CR) 10Dzahn: [C: 032] "thx for review" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137069 (owner: 10Dzahn) [18:21:25] (03CR) 10Krinkle: "A more elaborate commit summary would've been useful. Neither the message nor the diff mention anything related to what this is actually a" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136717 (owner: 10Gilles) [18:21:46] (03PS2) 10Ori.livneh: Removed maxvirtualmemory stuff from jobs-loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/135136 (owner: 10Aaron Schulz) [18:22:40] (03CR) 10Aaron Schulz: [C: 031] More refactoring for role::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/137061 (owner: 10Ori.livneh) [18:23:16] AaronSchulz: thanks for the CRs [18:23:44] PROBLEM - Puppet freshness on search1006 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:22:54 UTC [18:24:40] mutante: filippo already +1'd https://gerrit.wikimedia.org/r/#/c/135136/ , but i am only permitted to merge if it's my change, and this one happens to be aaron's. do you think you could +2 it? [18:24:55] chasemp: Could not find class groups::search for search1006.eqiad.wmnet :/ [18:25:13] mutante: k [18:25:33] should be a straight removal it was only rainman [18:25:34] and he's gone [18:26:24] chasemp: found it.. fixing [18:26:38] mutante: sweet thanks [18:26:49] ori: i'll take a look after this [18:27:24] (03CR) 10Ori.livneh: [C: 031] Removed maxvirtualmemory stuff from jobs-loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/135136 (owner: 10Aaron Schulz) [18:30:20] (03PS1) 10Dzahn: remove deleted search admin group from search [operations/puppet] - 10https://gerrit.wikimedia.org/r/137073 [18:31:10] (03CR) 10Dzahn: [C: 032] remove deleted search admin group from search [operations/puppet] - 10https://gerrit.wikimedia.org/r/137073 (owner: 10Dzahn) [18:34:44] PROBLEM - Puppet freshness on search1010 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:34:08 UTC [18:34:51] (03CR) 10Krinkle: [C: 031] rcstream: add lvs, modify mountpoint [operations/puppet] - 10https://gerrit.wikimedia.org/r/136990 (owner: 10Giuseppe Lavagetto) [18:35:25] (03CR) 10Krinkle: "ping" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108484 (https://bugzilla.wikimedia.org/54291) (owner: 10Ori.livneh) [18:36:34] chasemp: arrg.. there's also that lsearch user and it wants the group [18:36:59] chasemp: human and system user were in one group..grrrr [18:37:13] ah [18:37:23] so now dependency Group[search] for User[lsearch] [18:38:08] generic::systemuser should create the default group...? [18:38:31] generic::systemuser { 'lsearch': [18:38:37] .. [18:38:40] default_group => 'search', [18:38:45] ah [18:38:46] it's [18:38:52] if they are named the same yeah [18:39:46] ideally lsearch has PUG of lsearch [18:39:57] owner => 'lsearch', [18:40:04] group => 'search', [18:40:34] makes another change [18:40:44] PROBLEM - Puppet freshness on search1024 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:40:07 UTC [18:40:59] search group is used for nothing else [18:41:01] that I can see [18:41:03] there [18:44:13] (03PS1) 10Dzahn: use lsearch as group for generic system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137078 [18:44:44] PROBLEM - Puppet freshness on search1012 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:44:00 UTC [18:45:07] (03CR) 10Rush: [C: 031] "cool" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137078 (owner: 10Dzahn) [18:45:43] (03CR) 10Dzahn: [C: 032] use lsearch as group for generic system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137078 (owner: 10Dzahn) [18:45:44] PROBLEM - Puppet freshness on search1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:44:36 UTC [18:46:16] ^ that should fix the freshness checks soon.. waiting for jenkins [18:46:28] <^d> Thanks for the merge on searchidx* from the scap groups. [18:46:44] PROBLEM - Puppet freshness on search1007 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:45:42 UTC [18:46:44] PROBLEM - Puppet freshness on search1003 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:45:37 UTC [18:47:22] mutante: still no go I think :) [18:47:23] crap [18:47:48] ^d: ^ fyi.. we removed the old "search" group that just had rainman in it.. just some follow-up fixes now because it used "search" as group for user "lsearch" [18:48:09] (03PS1) 10Yuvipanda: toollabs: Add role for mongodb [operations/puppet] - 10https://gerrit.wikimedia.org/r/137079 [18:48:10] chasemp: no wait.. [18:48:11] Coren: ^ minor [18:48:34] RECOVERY - Puppet freshness on search1006 is OK: puppet ran at Tue Jun 3 18:48:31 UTC 2014 [18:48:44] PROBLEM - Puppet freshness on search1005 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:47:38 UTC [18:48:44] RECOVERY - Puppet freshness on search1005 is OK: puppet ran at Tue Jun 3 18:48:41 UTC 2014 [18:48:48] <^d> mutante: Yeah I saw the patches, sounds good. [18:49:02] (03CR) 10coren: [C: 032] "Trivial enough." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137079 (owner: 10Yuvipanda) [18:49:13] Coren: ty [18:49:36] * Coren "patiently" waits for Jenkins. [18:50:24] RECOVERY - Disk space on searchidx1001 is OK: DISK OK [18:50:30] <^d> mutante: Fixed for good now ^ [18:50:45] <^d> searchidx now has 36% free space on / [18:50:53] <^d> And we won't fill it up with silly scaps anymore. [18:50:59] ^d: great:) [18:53:44] PROBLEM - Puppet freshness on search1011 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:52:51 UTC [18:53:44] PROBLEM - Puppet freshness on search1016 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:52:46 UTC [18:53:44] RECOVERY - Puppet freshness on search1011 is OK: puppet ran at Tue Jun 3 18:53:39 UTC 2014 [18:53:54] RECOVERY - Puppet freshness on search1016 is OK: puppet ran at Tue Jun 3 18:53:49 UTC 2014 [18:54:34] RECOVERY - Puppet freshness on search1001 is OK: puppet ran at Tue Jun 3 18:54:24 UTC 2014 [18:54:44] PROBLEM - Puppet freshness on search1004 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:53:47 UTC [18:54:44] RECOVERY - Puppet freshness on search1004 is OK: puppet ran at Tue Jun 3 18:54:39 UTC 2014 [18:54:54] RECOVERY - Puppet freshness on search1003 is OK: puppet ran at Tue Jun 3 18:54:44 UTC 2014 [18:55:04] RECOVERY - Puppet freshness on search1007 is OK: puppet ran at Tue Jun 3 18:54:59 UTC 2014 [18:56:19] (03CR) 10Dzahn: [C: 032] Removed maxvirtualmemory stuff from jobs-loop [operations/puppet] - 10https://gerrit.wikimedia.org/r/135136 (owner: 10Aaron Schulz) [18:56:24] RECOVERY - Puppet freshness on search1010 is OK: puppet ran at Tue Jun 3 18:56:15 UTC 2014 [18:56:24] RECOVERY - Puppet freshness on search1012 is OK: puppet ran at Tue Jun 3 18:56:15 UTC 2014 [18:56:59] ori: there, per existing reviews [18:57:06] gotta get lunch [18:58:21] ah, i'll check the cron jobs on terbium [18:58:44] PROBLEM - Puppet freshness on search1002 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:57:58 UTC [18:58:44] PROBLEM - Puppet freshness on search1009 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 15:58:18 UTC [18:58:44] RECOVERY - Puppet freshness on search1002 is OK: puppet ran at Tue Jun 3 18:58:37 UTC 2014 [18:58:44] RECOVERY - Puppet freshness on search1009 is OK: puppet ran at Tue Jun 3 18:58:42 UTC 2014 [18:59:56] (03CR) 10Dzahn: [C: 04-2] "solved via admin yaml researchers group. this can be abandoned. see latest ticket updates" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122401 (owner: 10Ottomata) [19:00:03] (03PS1) 10Rush: file_mover user to generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137087 [19:01:21] (03CR) 10jenkins-bot: [V: 04-1] file_mover user to generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137087 (owner: 10Rush) [19:01:44] PROBLEM - Puppet freshness on search1015 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:00:47 UTC [19:01:44] RECOVERY - Puppet freshness on search1015 is OK: puppet ran at Tue Jun 3 19:01:41 UTC 2014 [19:01:44] RECOVERY - Puppet freshness on search1024 is OK: puppet ran at Tue Jun 3 19:01:41 UTC 2014 [19:03:04] (03PS1) 10Yuvipanda: toollabs: Remove unused and empty webproxy role [operations/puppet] - 10https://gerrit.wikimedia.org/r/137088 [19:03:05] Coren: ^ [19:05:44] PROBLEM - Puppet freshness on search1018 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:04:34 UTC [19:05:44] RECOVERY - Puppet freshness on search1018 is OK: puppet ran at Tue Jun 3 19:05:36 UTC 2014 [19:07:15] (03PS2) 10Rush: file_mover user to generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137087 [19:07:44] PROBLEM - Puppet freshness on search1019 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:07:00 UTC [19:07:44] RECOVERY - Puppet freshness on search1019 is OK: puppet ran at Tue Jun 3 19:07:38 UTC 2014 [19:09:44] PROBLEM - Puppet freshness on search1014 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:09:08 UTC [19:10:04] RECOVERY - Puppet freshness on search1014 is OK: puppet ran at Tue Jun 3 19:09:55 UTC 2014 [19:11:09] (03CR) 10coren: [C: 032] "Ex-term-mi-nate!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137088 (owner: 10Yuvipanda) [19:11:17] Coren: haha! :) [19:11:43] commonswiki: [c27683fc] /w/index.php?title=File:UnderwoodKeyboard.jpg&action=submit Exception from line 161 of /usr/local/apache/common-local/php-1.24wmf6/extensions/ConfirmEdit/FancyCaptcha.class.php: Ran out of captcha images [19:11:45] hrm, odd [19:13:41] Uhhhh [19:13:47] So tin is rejecting my public key [19:13:57] are you sure you are marktraceur, marktraceur? [19:13:57] bast1001 seems happy with it [19:14:10] YuviPanda: I'm definitely a bit wonky today but I'm absolutely marktraceur [19:15:07] I have a deploy window in less than 2 hours and I can't SSH to the deploy bastion >.< [19:16:00] marktraceur: Maybe it is a sign? :p [19:16:01] (03PS3) 10Krinkle: Set $wgIncludejQueryMigrate = true; for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137053 (https://bugzilla.wikimedia.org/44740) [19:16:22] (03PS4) 10Krinkle: Set $wgIncludejQueryMigrate = true; for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137053 (https://bugzilla.wikimedia.org/44740) [19:16:24] JohnLewis: Possibly, maybe the deployment gods are telling me that I'm doing the wrong thing [19:17:25] After the deploy window is over, your key will start working again :D [19:17:29] mutante: danke [19:17:35] https://dpaste.de/obVC is the relevant config [19:17:53] Copied it from my laptop backup which was working before it died [19:18:11] The laptop died, that is [19:18:58] And I've had no trouble SSHing other places (and as above, bast1001 seems happy) [19:19:45] I will try a different machine, I guess [19:20:44] PROBLEM - Puppet freshness on tungsten is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 13:19:16 UTC [19:21:57] tungsten puppet freshness is due to _joe_'s job queue monitoring patch. i fixed it in https://gerrit.wikimedia.org/r/#/c/137043/ , if someone wants to review [19:22:44] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [19:23:32] (03CR) 10Ori.livneh: "I think that perhaps we should simply have the script write to a nonstandard FD for the time being" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135133 (owner: 10Aaron Schulz) [19:24:01] (03CR) 10coren: [C: 031] "Not insane." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137068 (owner: 10Ori.livneh) [19:24:14] (03PS2) 10Ori.livneh: role::mediawiki: drop $run_jobs_enabled param from role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/137068 [19:24:19] (03CR) 10Ori.livneh: [C: 032 V: 032] role::mediawiki: drop $run_jobs_enabled param from role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/137068 (owner: 10Ori.livneh) [19:25:52] (03CR) 10coren: "Why redundant? The class names are fairly transparent, arguably, but I'm not sure what is gained by removing the human-readable descripti" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137065 (owner: 10Ori.livneh) [19:27:15] (03CR) 10Ori.livneh: "What is gained by keeping them? As you say, the role names are transparent. All code is cruft until proven otherwise. :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137065 (owner: 10Ori.livneh) [19:29:41] ori: I don't feel strongly about it, but given that not all class names are that transparent I'm pretty sure that "have a description meant for humans" is good practice; and that it's better to be consistent. [19:30:11] (03CR) 10coren: [C: 031] "Simple > complicated" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137061 (owner: 10Ori.livneh) [19:30:27] (03CR) 10Ori.livneh: "(As for what is gained: I prefer concision, and the ability to apprehend all relevant details in a screenful of code!)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137065 (owner: 10Ori.livneh) [19:30:35] Coren: np, i understand your thinking and it makes sense [19:30:59] (03PS3) 10Ori.livneh: More refactoring for role::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/137061 [19:31:45] (03CR) 10Ori.livneh: [C: 032 V: 032] More refactoring for role::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/137061 (owner: 10Ori.livneh) [19:32:25] (03CR) 10coren: [C: 031] "I don't feel strongly about it either way; and there is certainly no technical obstacle to doing so, hence +1 even though I wouldn't have " [operations/puppet] - 10https://gerrit.wikimedia.org/r/137065 (owner: 10Ori.livneh) [19:33:12] (03CR) 10coren: [C: 031] "Factoring FTW" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137047 (owner: 10Ori.livneh) [19:33:51] (03CR) 10coren: [C: 031] "Trivial." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137045 (owner: 10Ori.livneh) [19:34:05] Coren: <3 <3 [19:34:24] OK, new laptop SSHes without issue [19:35:21] (03CR) 10coren: "The puppet seems good, but my deployment-fu is too weak to judge the impact." [operations/puppet] - 10https://gerrit.wikimedia.org/r/136920 (owner: 10Ori.livneh) [19:37:28] (03CR) 10coren: [C: 031] "Moar factoringz!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136830 (owner: 10Ori.livneh) [19:39:55] Coren: hugely appreciated, thank you [19:40:22] ori: Sorry I didn't +1 the deployment one, but I don't trust my evaluation of its impact. [19:40:51] Coren: no problem at all [19:49:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data exceeded the critical threshold [500.0] [20:00:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 19:58:14 UTC [20:02:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 19:58:14 UTC [20:03:18] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [20:04:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 19:58:14 UTC [20:05:10] (03CR) 10Jforrester: [C: 031] Set $wgIncludejQueryMigrate = true; for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137053 (https://bugzilla.wikimedia.org/44740) (owner: 10Krinkle) [20:06:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 19:58:14 UTC [20:08:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 19:58:14 UTC [20:09:40] (03PS3) 10Ori.livneh: Move Apache gmond module to ::apache::monitoring; pep8 fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/136830 [20:09:49] (03CR) 10Ori.livneh: [C: 032 V: 032] Move Apache gmond module to ::apache::monitoring; pep8 fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/136830 (owner: 10Ori.livneh) [20:10:08] RECOVERY - Puppet freshness on mw1051 is OK: puppet ran at Tue Jun 3 20:10:00 UTC 2014 [20:10:41] osmium warwiki: Could not unserialize cirrusSearchLinksUpdatePrioritized job. [20:10:45] lots of that kind of spam [20:11:11] (03CR) 10Ori.livneh: [C: 031] rcstream: add DNS records for stream.w.o [operations/dns] - 10https://gerrit.wikimedia.org/r/136983 (owner: 10Giuseppe Lavagetto) [20:11:59] ori: are you running anything there? [20:12:18] AaronSchulz: no, but there are things running, wtf [20:12:18] will stop [20:12:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 20:11:10 UTC [20:14:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 20:11:10 UTC [20:14:58] RECOVERY - Puppet freshness on mw1051 is OK: puppet ran at Tue Jun 3 20:14:51 UTC 2014 [20:15:25] (03PS2) 10Ori.livneh: Move Nrpe::Monitor_service[twemproxy] to twemproxy::monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/137047 [20:16:00] (03CR) 10Ori.livneh: [C: 032 V: 032] Move Nrpe::Monitor_service[twemproxy] to twemproxy::monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/137047 (owner: 10Ori.livneh) [20:16:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 20:14:51 UTC [20:18:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 20:14:51 UTC [20:19:03] (03CR) 10Dzahn: [C: 032] "so we don't have 2 copies of it. this is the one for pmtpa, but on fenari it's not managed by puppet" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137051 (owner: 10Dzahn) [20:20:27] (03PS1) 10Gilles: Make enwiki sampling more conservative [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137155 [20:20:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 20:14:51 UTC [20:22:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 20:14:51 UTC [20:23:43] (03PS2) 10Gilles: Make enwiki MediaViewer EventLogging sampling more conservative [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137155 [20:23:44] i'm not sure what is up with 1051 [20:23:50] puppet runs are succeeding [20:24:04] paravoid: ping, since you're on RT.. [20:24:42] ori: how close is twemproxy in osmium? [20:24:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 20:14:51 UTC [20:25:45] AaronSchulz: next on p-void's agenda, afaik. he said so in the weekly ops mtng yesterday [20:25:49] he's a busy man :) [20:25:57] (03CR) 10Dzahn: [C: 031] Move OCG default port to 8000 [operations/puppet] - 10https://gerrit.wikimedia.org/r/134975 (owner: 10Mwalker) [20:26:06] "Notice: File could not be loaded: ../../../../../a/common/multiversion/MWScript.php" [20:26:08] gah [20:26:58] PROBLEM - Puppet freshness on mw1051 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 20:14:51 UTC [20:27:41] AaronSchulz: where? [20:27:54] osmium, running mwscript [20:27:58] RECOVERY - Puppet freshness on mw1051 is OK: puppet ran at Tue Jun 3 20:27:57 UTC 2014 [20:30:00] (03CR) 10Dzahn: [C: 032] Typofix [operations/puppet] - 10https://gerrit.wikimedia.org/r/135582 (owner: 10Nemo bis) [20:30:42] (03PS2) 10Dzahn: Completely remove misc::maintenance::geodata [operations/puppet] - 10https://gerrit.wikimedia.org/r/135623 (owner: 10MaxSem) [20:31:00] (03CR) 10jenkins-bot: [V: 04-1] Completely remove misc::maintenance::geodata [operations/puppet] - 10https://gerrit.wikimedia.org/r/135623 (owner: 10MaxSem) [20:31:34] mutante, you need to deploy that patch's dependency [20:31:40] on terbium [20:34:00] (03CR) 10Dzahn: Kill GeoData Solr, decom servers (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/133886 (owner: 10MaxSem) [20:35:54] (03CR) 10Hashar: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135623 (owner: 10MaxSem) [20:37:55] MaxSem: the netboot.cfg line should just be removed entirely [20:38:43] it might break the bash case in partman [20:38:55] can amend though [20:39:30] (03PS1) 10Ori.livneh: Clear up duplicate /var/log/mediawiki/wikidata dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/137165 [20:39:53] mutante: followup to your/reedy's changes ^ [20:40:13] (03PS2) 10Ori.livneh: mediawiki::sync: Drop File['/apache'] [operations/puppet] - 10https://gerrit.wikimedia.org/r/137045 [20:40:22] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki::sync: Drop File['/apache'] [operations/puppet] - 10https://gerrit.wikimedia.org/r/137045 (owner: 10Ori.livneh) [20:40:47] (03CR) 10Dzahn: Clear up duplicate /var/log/mediawiki/wikidata dir (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/137165 (owner: 10Ori.livneh) [20:41:20] mutante: owner => apache then? [20:41:47] !log Jenkins repacking gerritslave replicas on gallium and lanthanum. Running in screen as hashar -> gerritslave [20:41:51] Logged the message, Master [20:41:59] !log repack command: find /srv/ssd/gerrit/ -type d -name '*.git' -print -exec git --git-dir="{}" repack -afd \; -exec git --git-dir="{}" pack-refs --all \; [20:42:03] Logged the message, Master [20:42:06] ori: i think so, because now apache runs all the crons [20:42:54] ori: apache vs. wikidev etc. is per https://wikitech.wikimedia.org/wiki/UID [20:43:10] the "permission/security hierarchy" section [20:43:37] "scripts owned by mwdeploy can only be run by apache" hmmm [20:43:57] mutante: makes sense [20:44:07] hmm @ who owns the directory [20:44:14] and owner vs group.. [20:44:23] for the logfiles of those scripts [20:44:45] (03PS3) 10Rush: file_mover user to generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137087 [20:44:50] (03CR) 10Rush: [C: 032 V: 032] file_mover user to generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137087 (owner: 10Rush) [20:45:51] (03PS2) 10Ori.livneh: Clear up duplicate /var/log/mediawiki/wikidata dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/137165 [20:46:18] (03PS6) 10Gage: initial debianization [operations/debs/python-statsd] - 10https://gerrit.wikimedia.org/r/131449 [20:46:36] Attention, attention: dewiki is going to get media viewer turned on by default in about 15 minutes. Shortly thereafter, enwiki will get the same. Thank you for observing all safety precautions. [20:46:53] jgage: \o/ :) [20:46:54] * MatmaRex protects WP:VPT [20:47:10] (just kidding, i actually don't have the rights to do that :( ) [20:47:17] mutante: amended [20:47:40] (03PS1) 10Rush: file_mover uid/gid fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/137167 [20:47:47] (03PS2) 10Rush: file_mover uid/gid fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/137167 [20:47:49] (03PS5) 10MaxSem: Kill GeoData Solr, decom servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/133886 [20:47:51] (03CR) 10Rush: [C: 032 V: 032] file_mover uid/gid fix [operations/puppet] - 10https://gerrit.wikimedia.org/r/137167 (owner: 10Rush) [20:48:05] mutante, ^^^ [20:48:31] (03CR) 10Dzahn: Clear up duplicate /var/log/mediawiki/wikidata dir (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/137165 (owner: 10Ori.livneh) [20:48:50] ori: 0644 instead i think.. commented [20:49:01] mutante: Are you guys doing things that might interfere with a config sync? It looks like no but I don't want to tempt fate [20:49:03] MaxSem: thanks, i _just_ wanted to upload that and rebasing [20:49:13] (03CR) 10Gage: "* Renamed package, binary, and user to pystatsd" [operations/debs/python-statsd] - 10https://gerrit.wikimedia.org/r/131449 (owner: 10Gage) [20:49:55] mutante: no, nothing [20:49:57] marktraceur: if you mean the logdir, no that should not influence a config sync [20:50:03] that was at marktraceur [20:50:13] 'kay thanks both [20:50:45] (03PS3) 10Ori.livneh: Clear up duplicate /var/log/mediawiki/wikidata dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/137165 [20:50:55] mutante: amended :) thanks for reviewing [20:51:22] (03CR) 10Dzahn: [C: 031] Clear up duplicate /var/log/mediawiki/wikidata dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/137165 (owner: 10Ori.livneh) [20:51:42] (03CR) 10Ori.livneh: [C: 032 V: 032] Clear up duplicate /var/log/mediawiki/wikidata dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/137165 (owner: 10Ori.livneh) [20:51:50] !log Disabled GeoData updates on terbium [20:51:55] Logged the message, Master [20:52:29] (03PS6) 10Dzahn: Kill GeoData Solr, decom servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/133886 (owner: 10MaxSem) [20:53:10] (03CR) 10Dzahn: [C: 032] Kill GeoData Solr, decom servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/133886 (owner: 10MaxSem) [20:53:29] Just TTM to go! [20:54:18] (03PS2) 10Ori.livneh: role::mediawiki: drop redundant role descriptions [operations/puppet] - 10https://gerrit.wikimedia.org/r/137065 [20:54:35] Reedy: did you notice the log change was merged after being revived from the dead?:) [20:55:40] (03CR) 10Ori.livneh: [C: 032 V: 032] role::mediawiki: drop redundant role descriptions [operations/puppet] - 10https://gerrit.wikimedia.org/r/137065 (owner: 10Ori.livneh) [20:55:43] Yeah! [20:55:47] :) [20:57:02] greg-g: Confirming that we'll be hitting some big red buttons in about 3 minutes here. Just FYI. [20:57:11] (03CR) 10Dzahn: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133886 (owner: 10MaxSem) [20:57:21] marktraceur: k, enjoy [21:00:01] * marktraceur goes [21:00:04] marktraceur: The time is nigh to deploy Media Viewer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140603T2100) [21:00:04] bsitu: The time is nigh to deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140603T2100) [21:00:09] (03CR) 10MarkTraceur: [C: 032] Launch Media Viewer for all users on German wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134811 (owner: 10Gilles) [21:00:49] simudeploy [21:00:52] So...bsitu isn't here [21:02:31] er [21:02:49] they don't need it [21:03:00] where 'they' == flow [21:03:24] Doesn't jenkins merge stuff in mediawiki-config yet? [21:03:39] he should [21:03:44] (03PS1) 10Rush: admins.pp remove unused groupings and service vol 1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/137172 [21:03:52] (and has for ages) [21:04:11] * marktraceur waits then [21:04:12] (03PS2) 10Rush: admins.pp remove unused groupings and service vol 1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/137172 [21:05:05] Queue lengths: 33 events, 111 results. [21:05:10] marktraceur: You might be waiting a while [21:05:23] (03PS1) 10QChris: Make dbstore1002 handle s2 analytics queries [operations/dns] - 10https://gerrit.wikimedia.org/r/137174 (https://bugzilla.wikimedia.org/66068) [21:05:32] Go figure [21:05:50] (03PS4) 10Reedy: Launch Media Viewer for all users on German wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134811 (owner: 10Gilles) [21:05:58] (03CR) 10Reedy: [C: 032 V: 032] "Jenkins is too busy" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134811 (owner: 10Gilles) [21:06:29] Heh [21:06:32] Reedy > Jenkins [21:06:35] it's busy with [21:06:51] https://gerrit.wikimedia.org/r/#/c/137065/ for some reason [21:07:03] and tssk tssk.. it was merged before jenkins was done [21:07:32] yeah it receives a bunch of patches [21:07:40] i will probably move zuul to another host [21:07:54] hashar just pointed me to https://integration.wikimedia.org/zuul/ again , rightfully [21:08:02] !log marktraceur updated /a/common to {{Gerrit|Ie237b0ae1}}: Launch Media Viewer for all users on German wikipedia [21:08:03] Yeah that's what I was watching [21:08:07] Logged the message, Master [21:08:19] PROBLEM - Disk space on gallium is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 16 MB (3% inode=99%): [21:08:22] :p [21:08:24] ahh [21:08:33] I want to sleep [21:08:49] !log marktraceur Synchronized mediaviewer.dblist: Add dewiki to the on-by-default list for Media Viewer (duration: 00m 06s) [21:08:54] Logged the message, Master [21:08:55] Oooh, the sync-file output got prettier [21:09:01] I assume a "thanks bd808" is in order [21:09:14] yw! [21:09:42] !log marktraceur Synchronized wmf-config/InitialiseSettings.php: Touch InitialiseSettings.php because that's what we do (duration: 00m 06s) [21:09:43] all praise the benevolent bd808 for the many deployment related things [21:09:44] #helpfullogentries [21:09:46] (03CR) 10Rush: [C: 032] admins.pp remove unused groupings and service vol 1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/137172 (owner: 10Rush) [21:09:47] Logged the message, Master [21:09:48] (03PS1) 10Jgreen: add dash.frdev.wm.o cname [operations/dns] - 10https://gerrit.wikimedia.org/r/137175 [21:10:13] may we be grateful for the bounty he has bestowed upon us [21:10:19] RECOVERY - Disk space on gallium is OK: DISK OK [21:10:22] hashar: how can we help? delete oldes files? [21:10:24] ah [21:10:40] yeah anything older than a day can be deleted [21:10:42] marktraceur: FYI InitialiseSettings.php is automatically touched by sync-file, sync-dir and scap now [21:10:47] or even older than an hour [21:10:54] Attention, attention, dewiki now has Media Viewer switched on by default. Get ready for the hordes. enwiki happens in 15 minutes. [21:10:55] hashar: i'll do it if it happens again, enjoy sleep now [21:11:02] thanks [21:11:07] bd808: I will hug you when I am done deploying [21:11:57] (03CR) 10MarkTraceur: [C: 032] Launch Media Viewer for all users on English wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134812 (owner: 10Gilles) [21:11:59] (03CR) 10MarkTraceur: [C: 032] Make enwiki MediaViewer EventLogging sampling more conservative [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137155 (owner: 10Gilles) [21:12:01] (03CR) 10Jgreen: [C: 032 V: 031] add dash.frdev.wm.o cname [operations/dns] - 10https://gerrit.wikimedia.org/r/137175 (owner: 10Jgreen) [21:12:08] marktraceur: he's local, so you can! [21:12:10] I don't need these to be fast, happily [21:12:14] greg-g: Exactly! [21:12:17] (03CR) 10jenkins-bot: [V: 04-1] Launch Media Viewer for all users on English wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134812 (owner: 10Gilles) [21:12:18] There shall be no escaping [21:12:22] ....wat [21:12:30] (03PS4) 10MarkTraceur: Launch Media Viewer for all users on English wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134812 (owner: 10Gilles) [21:12:44] Trivial bloody rebase [21:13:06] (03CR) 10MarkTraceur: Launch Media Viewer for all users on English wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134812 (owner: 10Gilles) [21:13:19] (03CR) 10MarkTraceur: [C: 032] "Stupid robots" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134812 (owner: 10Gilles) [21:13:20] MaxSem: now it's actually merged [21:13:21] ok sleeping for now [21:13:30] (03Merged) 10jenkins-bot: Make enwiki MediaViewer EventLogging sampling more conservative [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137155 (owner: 10Gilles) [21:13:50] silly jenkins [21:14:25] Error 400 on SERVER: Could not find template 'misc/update-geodata.erb [21:14:28] hrmm [21:14:40] at /etc/puppet/manifests/misc/maintenance.pp:308 [21:15:38] (03Merged) 10jenkins-bot: Launch Media Viewer for all users on English wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134812 (owner: 10Gilles) [21:17:13] !log marktraceur updated /a/common to {{Gerrit|I549906510}}: Launch Media Viewer for all users on English wikipedia [21:17:17] Logged the message, Master [21:17:44] here. we. go! [21:17:55] mutante, let me make a temporary plug [21:18:13] marktraceur: congrats! :) [21:18:19] Not synced yet [21:18:37] marktraceur: oh, right [21:18:47] Syncing the throttling change first [21:18:55] !log marktraceur Synchronized wmf-config/InitialiseSettings.php: Throttle the MMV event logging a bit more for the launch today (duration: 00m 06s) [21:19:00] Logged the message, Master [21:19:51] (03PS1) 10Ori.livneh: Migrate beta appserver configs to role::beta::appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/137177 [21:20:44] just browsed some picture pages on dewiki. Looks like this has been a great job, lads. (though progressive loading could be a little faster) [21:21:33] (03PS1) 10MaxSem: Fix maintenance::geodata [operations/puppet] - 10https://gerrit.wikimedia.org/r/137178 [21:21:38] mutante, ^^ [21:22:23] hedonil: Thanks :) [21:22:36] (03CR) 10Dzahn: [C: 032] Fix maintenance::geodata [operations/puppet] - 10https://gerrit.wikimedia.org/r/137178 (owner: 10MaxSem) [21:22:43] MaxSem: thanks, way simpler as well [21:22:55] marktraceur: yw. looks great, feels good. [21:23:27] hedonil: Standby for the enwiki launch :) [21:24:58] OK, here we go :) [21:25:08] !log marktraceur Synchronized mediaviewer.dblist: Enable media viewer by default on enwiki (duration: 00m 06s) [21:25:13] Logged the message, Master [21:26:20] Looks like no issues [21:26:28] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2154: active_shards: 6461: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 1 [21:27:15] MaxSem: crons are removed now [21:27:28] mutante, whee [21:27:29] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2154: active_shards: 6461: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:27:46] rebasing the other change... [21:27:57] MaxSem: cool! [21:29:13] (03CR) 10Aaron Schulz: [C: 031] remove pmtpa app server monitor_groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/136956 (owner: 10Ori.livneh) [21:29:59] (03PS1) 10Rush: generate list of _would_ be removed accounts [operations/puppet] - 10https://gerrit.wikimedia.org/r/137179 [21:30:16] (03CR) 10Dzahn: [C: 032] "yes, please "really dont want to give NodeJS root" and nothing hits the actual server right now" [operations/puppet] - 10https://gerrit.wikimedia.org/r/134975 (owner: 10Mwalker) [21:31:00] (03PS2) 10Rush: generate list of _would_ be removed accounts [operations/puppet] - 10https://gerrit.wikimedia.org/r/137179 [21:31:16] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 18:30:11 UTC [21:31:23] (03CR) 10Rush: [C: 032 V: 032] "no invasive action" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137179 (owner: 10Rush) [21:31:25] (03CR) 10Dzahn: [C: 031] remove pmtpa app server monitor_groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/136956 (owner: 10Ori.livneh) [21:31:52] (03PS2) 10Ori.livneh: remove pmtpa app server monitor_groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/136956 [21:32:00] (03CR) 10Ori.livneh: [C: 032 V: 032] remove pmtpa app server monitor_groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/136956 (owner: 10Ori.livneh) [21:32:46] paravoid: are our swift backends in different zones, or all in one? (Docs imply that zones should be in different rows or racks or dcs) [21:34:51] (03CR) 10Dzahn: [C: 031] scap: ensure=>absent /usr/local/bin/sync-common-file [operations/puppet] - 10https://gerrit.wikimedia.org/r/135924 (owner: 10BryanDavis) [21:36:46] mediaviewer on wikitech? [21:37:13] mutante: Is it? [21:37:15] lol [21:37:25] (03PS2) 10Ori.livneh: Migrate beta appserver configs to role::beta::appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/137177 [21:37:51] no, i'm wondering if you want to add it. [21:38:03] then we have galleries for the dc pictures [21:38:13] 1.23wmf22 [21:38:15] tut tut [21:38:22] you'd better upgrade it first ;) [21:38:31] :p [21:39:33] (03PS1) 10MaxSem: Completely remove misc::maintenance::geodata [operations/puppet] - 10https://gerrit.wikimedia.org/r/137192 [21:39:44] I wouldn't object [21:40:04] mutante, ^ [21:40:19] somehow managed to do it in a new change [21:40:37] (03Abandoned) 10MaxSem: Completely remove misc::maintenance::geodata [operations/puppet] - 10https://gerrit.wikimedia.org/r/135623 (owner: 10MaxSem) [21:41:30] andrewbogott: sanity-check for https://gerrit.wikimedia.org/r/#/c/136920/ ? (hint: it's sane ;)) [21:41:44] (03PS1) 10Ori.livneh: rename role::mediawiki::job_runner -> role::mediawiki::jobrunner [operations/puppet] - 10https://gerrit.wikimedia.org/r/137193 [21:42:59] (03CR) 10BryanDavis: [C: 031] "Cherry-picked and applied in beta. THings still work! :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137177 (owner: 10Ori.livneh) [21:43:16] ori: probably best to get a review from someone who has ever looked at the deployment system :) [21:43:34] andrewbogott: that's me and bryan [21:43:40] i wrote the patch, bryan already +1'd [21:43:52] ok... [21:44:30] I'm going to quick update the config for betalabs so se4598 can have mmv on betadewiki [21:45:18] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data exceeded the critical threshold [500.0] [21:45:19] andrewbogott: change https://gerrit.wikimedia.org/r/#/c/137177/ is also beta in scope and bryan already cherry-picked/applied it in labs and confirmed that it works [21:47:03] ori: When you say 'not in use' do you mean not applied? Or applied but not, um… used? [21:47:17] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 18:46:58 UTC [21:47:24] (03PS1) 10Bartosz Dziewoński: Replace the Nostalgia extension with the Nostalgia skin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137200 (https://bugzilla.wikimedia.org/61256) [21:47:36] (03CR) 10Dzahn: [C: 032] "yep, crons already ensured absent on terbium" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137192 (owner: 10MaxSem) [21:47:39] andrewbogott: ryan applied it in prod in his last few days as a way of demoing how trebuchet would work, but it's not used by anything. it is half a gig of files that a new mw needs to fetch [21:47:56] andrewbogott: so /srv/deployment/mediawiki/ is on all the app servers, but it's not doing anything; nothing is accessing it [21:48:01] MaxSem: done. bbl [21:48:03] andrewbogott: and the deployment scripts would break prod if someone were to run them [21:48:09] andrewbogott: since there have been many changes since then [21:48:09] (03PS1) 10MarkTraceur: Enable MMV by default on dewiki beta. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137201 [21:48:12] ok [21:48:14] mutante, thanks a bunch! :) [21:48:27] (03CR) 10MarkTraceur: [C: 032] Enable MMV by default on dewiki beta. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137201 (owner: 10MarkTraceur) [21:48:33] (03Merged) 10jenkins-bot: Enable MMV by default on dewiki beta. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137201 (owner: 10MarkTraceur) [21:48:51] Updated config so nobody's confused. [21:49:10] (03CR) 10Andrew Bogott: [C: 031] "This seems to do what it says it does :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136920 (owner: 10Ori.livneh) [21:49:44] !log marktraceur updated /a/common to {{Gerrit|I409703a11}}: Enable MMV by default on dewiki beta. [21:49:50] Logged the message, Master [21:50:00] andrewbogott: and https://gerrit.wikimedia.org/r/#/c/137177/ (which is labs only and which bryan already tested in labs) pretty please, then i'll stop harassing you [21:50:03] No sync, syncing is silly [21:52:37] (03PS3) 10Ori.livneh: remove Deployment::Target['mediawiki'] [operations/puppet] - 10https://gerrit.wikimedia.org/r/136920 [21:52:47] (03CR) 10Ori.livneh: [C: 032 V: 032] remove Deployment::Target['mediawiki'] [operations/puppet] - 10https://gerrit.wikimedia.org/r/136920 (owner: 10Ori.livneh) [21:55:27] (03CR) 10Andrew Bogott: "The fiddly bits of this seem fine, but what's the rationale behind making a beta-specific class? Weren't we better off running shared cod" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137177 (owner: 10Ori.livneh) [21:56:29] (03PS1) 10Rush: service accounts get 'systemuser' group by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/137204 [21:56:43] (03PS2) 10Rush: service accounts get 'systemuser' group by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/137204 [21:57:17] (03CR) 10Ori.livneh: "So, two points: 1) notice that the beta role was already including ::beta, which included a ::beta::config, etc. 2) this isn't the end-sta" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137177 (owner: 10Ori.livneh) [21:58:17] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [21:58:25] andrewbogott: btw, the deployment::target['mediawiki'] patch did the right thing; thanks [22:00:08] OK last thing (knock on more wood) we're going to do is back-back-port the preference name change from this morning [22:00:15] So....it's an i18n change and I'm going to scap. [22:00:21] Hurray [22:00:25] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Tue Jun 3 22:00:21 UTC 2014 [22:00:40] scap! All the cool kids do it [22:00:59] I've heard that [22:01:05] (03CR) 10Andrew Bogott: [C: 031] Migrate beta appserver configs to role::beta::appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/137177 (owner: 10Ori.livneh) [22:01:12] Man it's been a while since I deployed a code change [22:03:27] (03PS3) 10Rush: service accounts get 'systemuser' group by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/137204 [22:04:24] (03CR) 10Rush: "WIP" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137204 (owner: 10Rush) [22:04:29] andrewbogott: thanks very much [22:04:29] (03PS3) 10Ori.livneh: Migrate beta appserver configs to role::beta::appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/137177 [22:06:11] (03CR) 10Ori.livneh: [C: 032] Migrate beta appserver configs to role::beta::appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/137177 (owner: 10Ori.livneh) [22:08:10] andrewbogott: that one also applied correctly; thanks again [22:15:08] (03PS10) 10Ori.livneh: Add rsyslog module and port existing usage [operations/puppet] - 10https://gerrit.wikimedia.org/r/135447 [22:21:05] PROBLEM - Puppet freshness on tungsten is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 13:19:16 UTC [22:23:05] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [22:24:10] (03PS2) 10Ori.livneh: rename role::mediawiki::job_runner -> role::mediawiki::jobrunner [operations/puppet] - 10https://gerrit.wikimedia.org/r/137193 [22:26:42] Scapping now [22:27:55] !log marktraceur Started scap: Update Media Viewer preference string for wmf7 - already backported to wmf6 [22:28:00] Logged the message, Master [22:38:10] !log git-deploy: Deploying integration/slave-scripts If2e2e675802f [22:38:15] Logged the message, Master [22:41:14] !log marktraceur Finished scap: Update Media Viewer preference string for wmf7 - already backported to wmf6 (duration: 13m 19s) [22:41:19] Logged the message, Master [22:41:53] Thanks guys, SWAT is good to go in 18 minutes [22:41:57] We don't appear to have any fires [22:47:36] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Jun 3 22:47:30 UTC 2014 [23:00:05] mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140603T2300) [23:01:00] ori: mwalker I'm trying to get a few patches swatted in this window, but might've fuckec up the calendar. [23:01:02] looking [23:02:02] I'll swat [23:02:27] ori: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=115149&oldid=115136, I'm trying to fix that fuxkup now [23:02:27] hokay [23:04:05] oo, fixed now https://wikitech.wikimedia.org/wiki/Deployments [23:04:58] (03CR) 10Ori.livneh: [C: 032] Set $wgIncludejQueryMigrate = true; for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137053 (https://bugzilla.wikimedia.org/44740) (owner: 10Krinkle) [23:05:23] (03Merged) 10jenkins-bot: Set $wgIncludejQueryMigrate = true; for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137053 (https://bugzilla.wikimedia.org/44740) (owner: 10Krinkle) [23:06:22] !log ori Synchronized wmf-config/InitialiseSettings.php: I9dac0dc6a80: Set $wgIncludejQueryMigrate = true; for all wikis (duration: 00m 03s) [23:06:27] Logged the message, Master [23:10:25] !log ori Synchronized php-1.24wmf7/extensions/MobileApp: SWAT cherry-picks for MobileApp (duration: 00m 03s) [23:10:30] Logged the message, Master [23:11:05] !log ori Synchronized php-1.24wmf6/extensions/MobileApp: SWAT cherry-picks for MobileApp (duration: 00m 04s) [23:11:10] Logged the message, Master [23:11:46] ori: how long does it take for RL cache to clear again? http://test.wikipedia.org/w/load.php?debug=true&lang=en&modules=mobile.app.pagestyles.android&only=styles&skin=vector&* is still showing pre-deploy responses [23:14:38] !log ori Synchronized php-1.24wmf7/extensions/MobileApp: SWAT cherry-picks for MobileApp (with patch) (duration: 00m 04s) [23:14:42] YuviPanda: i synced it before jenkins merged [23:14:42] Logged the message, Master [23:14:44] try now [23:14:54] ori: woo, wfm! [23:15:14] ori: and on enwiki as well. [23:15:15] cool [23:15:15] ori: thank you! [23:15:17] [23:15:19] np [23:22:34] mutante: https://gerrit.wikimedia.org/r/#/c/137193/ is trivial [23:23:24] ori: Hm.. GeoIP cookie thing, what's the status of that? [23:23:50] Currently on en.wikipedia.org I'm getting 3 cookies, and in the end it is GeoIP="::::v6" [23:24:02] .. set from the same HTTP request that also sets [23:24:13] Geo = {"city":"(null)","country":"NL","lat":"5xx","lon":"5xx","IP":"83.161.x"} [23:24:24] so that request knows it, and yet still sets it wrong [23:24:31] or is that intentional? [23:24:32] it's normal [23:24:42] actually, no [23:24:49] it should not launch a request for /geoiplookup [23:24:53] that's probably a centralnotice bug [23:25:11] GeoIP="::::v6" <-- this is normal [23:25:58] Cookie 1: Value: GeoIP="::::v6", Domain: .wikipedia.org [23:25:58] Cookie 2: Value: GeoIP="NL%3A_null)%3A52.500000%3A5.750000%3Av4", Domain: en.wikipedia.org [23:25:58] Cookie 3: Value: GeoIP="::::v6", Domain: .wikimedia.org [23:26:23] and HTTP 200 URL:https://geoiplookup.wikimedia.org/ [23:27:02] that one is requested with Cookie: GeoIP=::::v6, no Cookie in response [23:27:07] and body of Geo = {"city":"(null)","country":"NL","lat":"5xx","lon":"5xx","IP":"83.161.x"} [23:27:37] There shouldn't be .wikimedia.org wide cookies [23:27:41] that affects bits [23:27:50] what's the output of: mw.loader.using( 'mediawiki.inspect', function () { mw.inspect.grep('geoiplookup'); } ); [23:27:54] ori: interesting, what is that ::v6 cookie for? [23:28:21] Krinkle: actually, everything is working properly [23:28:23] here's the flow: [23:28:26] ["ext.centralNotice.bannerController"] [23:28:35] you request en.wikipedia.org [23:28:42] https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/master/modules/ext.centralNotice.bannerController/bannerController.js#L226-L234 [23:28:51] you're dual-stack, preferring ipv6 [23:29:03] we can't/don't do geoip lookups of ipv6 addresses [23:29:11] but we have https://geoiplookup.wikimedia.org/ which is ipv4 only [23:29:22] so centralnotice sees GeoIP="::::v6 [23:29:25] it doesn't know that you're dual-stack [23:29:36] so it launches a surrogate request for https://geoiplookup.wikimedia.org/ [23:29:43] it works, you connect via ipv4, lookup is successful [23:30:22] OK [23:30:30] And in case of strict ipv6? [23:30:39] geoiplookup.wikimedia.org fails to resolve [23:30:47] but we live with that [23:30:51] and strict ipv4? [23:30:59] we never launch a surrogate request [23:31:16] either ip lookup succeeds or fails, but if it fails, you get GeoIP="::::v4" [23:31:24] and centralnotice knows not to bother with geoiplookup.wikimedia.org [23:31:29] because the issue is not ipv6 [23:31:45] ori: in case of strict ipv4 there is no surrogate req, but still a normal one, right? [23:31:59] where does the Geo come from with ipv4? [23:32:00] there's no request apart from the request to get the page [23:32:25] $ curl -Is http://en.wikipedia.org/wiki/Main_page | grep Set-Cookie [23:32:26] Set-Cookie: GeoIP=US:San_Francisco:37.7749:-122.4194:v4; Path=/; Domain=.wikipedia.org [23:32:52] ori: ah, I see. [23:33:13] that's done directly in varnish, see templates/varnish/geoip.inc.vcl.erb in operations/puppet [23:33:19] ori: yeah [23:33:47] and I suppose eventhough geoiplookup is wmf as well (indirectly anyway), we can't use that because it's not on *.wikipedia.org [23:34:06] can't use it for what? [23:34:09] and we can't do it from within the regular request because we need the connnection to switch to ipv4 [23:34:13] to set it from the main page request [23:34:33] right now it seems any dual stack users never ever make use of the cookie to avoid re-requesting it from geoiplookup on every page view [23:35:07] All the cookie is set to is :::v6, and centralnotice requests from geoiplookup.wikimedia.org every time [23:35:08] if so, that's a bug [23:35:22] hm [23:35:34] I guess the idea is to use geoiplookup.wikimedia.org to request it client-side and push a cookie back to the browser to cache it, right [23:35:51] that's what I was going to suggest [23:35:59] I guess that is failing because varnish is overriding it again [23:36:07] (if CentralNotice is doing that all that is) [23:36:24] ea2a45d81d7 (Ori Livneh 2014-02-18 02:21:19 -0800 134) /* Perform GeoIP look-up and send the result as a session cookie */ [23:36:24] ea2a45d81d7 (Ori Livneh 2014-02-18 02:21:19 -0800 135) if (req.http.Cookie !~ "(^|;\s*)GeoIP=[^;]") { [23:36:24] ea2a45d81d7 (Ori Livneh 2014-02-18 02:21:19 -0800 136) call geoip_cookie; [23:36:26] ea2a45d81d7 (Ori Livneh 2014-02-18 02:21:19 -0800 137) } [23:36:30] is the regexp wrong? hm [23:37:48] i wonder if it's a race condition [23:38:06] i think it may be [23:38:28] that is: [23:38:44] 1) you request http://en.wikipedia.org/wiki/Main_Page , you get cookied with GeoIP=::::v6 [23:39:22] 2) CentralNotice loads, sends a request to //geoiplookup.wikimedia.org [23:39:23] Is that regex supposed to skip geoip_cookie if centralauth.js (or varnish itself in a previous request) was able to set a useful cookie with location? [23:40:33] Krinkle: yes [23:40:37] greg-g: http://lists.wikimedia.org/pipermail/wikitech-ambassadors/2014-June/000719.html http://lists.wikimedia.org/pipermail/wikitech-l/2014-June/076842.html http://lists.wikimedia.org/pipermail/mediawiki-l/2014-June/042916.html [23:40:44] 3) request is made to some other resource on en.wikipedia.org [23:41:04] 4) geoiplookup req suceeds, centralnotice calls $.cookie to set GeoIP=NL:... [23:41:27] 5) other request comes back and sets GeoIP=::::v6 [23:41:30] yeah [23:41:33] right [23:41:38] I forgot that last part [23:41:53] I was going to say, it's quite deterministic right now [23:42:34] Krinkle: yep, forwarded to engineering with a comment [23:42:43] Krinkle: thanks muchly [23:43:01] also kudos to James_F who (re)wrote most of it. [23:43:08] :) [23:43:26] yeah, sorry I couldn't help today, was in a long meeting with a guy from FB release team [23:43:39] $ curl -Is http://en.wikipedia.org/wiki/Main_page -H "Cookie: GeoIP=US:San_Francisco:37.7749:-122.4194:v4;" | grep Set-Cookie [23:43:39] greg-g: How was that? [23:43:52] greg-g: Progress-ful? [23:43:53] 19:20 < ori> greg-g: it was *very* useful! [23:43:55] ori: > Set-Cookie: GeoIP=NL::52.5000:5.7500:v4; Path=/; Domain=.wikipedia.org [23:43:59] Aha. Good. [23:44:00] Krinkle: yeah [23:44:01] :) [23:44:05] regexp must be wrong [23:44:08] I'll let others assess ;) [23:44:14] i just came to the same realization [23:46:15] ori: Hm.. but also it shouldn't set .wikimedia.org cookies, whatever thing is doing that. [23:46:39] my browser is sending them to all load.php requests now [23:46:41] :( [23:46:55] Coming from https://meta.wikimedia.org/w/index.php?title=User:Krinkle/RTRC-dev.js&action=raw&ctype=text/javascript [23:46:57] Set-Cookie:GeoIP=::::v6; Path=/; Domain=.wikimedia.org [23:47:10] Referer:https://en.wikipedia.org/wiki/Main_Page [23:47:45] ori: I'll file bugs if you want [23:47:57] please do, i'm investigating [23:48:11] thanks for poking about this! [23:48:52] ori: also, can you tell me again where the code is for the script you have that runs periodically server-side making phantomjs requests to various urls and asserts things? [23:48:56] (and how it is run) [23:49:10] I'd like to add a bunch of more stuff to it. [23:49:22] modules/webperf/files i think? [23:49:24] in operations-puppet [23:50:22] Ah, nice. [23:50:31] I knew that's where the eventlogging handlers where, didn't know about asset-check [23:50:34] cool