[00:00:22] "Your 131072x1 screen size is bogus. Expect trouble. [00:00:26] hah, right [00:01:58] !log maxsem synchronized php-1.24wmf4/extensions/MobileFrontend/ 'bug 65042' [00:02:05] Logged the message, Master [00:03:10] (03CR) 10Dzahn: "Dmitry, Yuvi: on caesium: Unixaccount[Dmitry Brant]/User[dbrant]/ensure: created" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132109 (owner: 10Dzahn) [00:03:31] !log maxsem synchronized php-1.24wmf3/extensions/MobileFrontend/ 'bug 65042' [00:03:36] Logged the message, Master [00:22:06] (03PS2) 10Dzahn: add dbrant to stat1003 "special users" and bast [operations/puppet] - 10https://gerrit.wikimedia.org/r/132110 [00:28:56] (03PS3) 10Dzahn: add dbrant to stat1003 "special users" and bast [operations/puppet] - 10https://gerrit.wikimedia.org/r/132110 [00:30:37] (03PS4) 10Dzahn: add dbrant to stat1003 "special users" and bast [operations/puppet] - 10https://gerrit.wikimedia.org/r/132110 [00:33:35] (03CR) 10Dzahn: [C: 032] add dbrant to stat1003 "special users" and bast [operations/puppet] - 10https://gerrit.wikimedia.org/r/132110 (owner: 10Dzahn) [00:33:38] (03CR) 10Gergő Tisza: [C: 04-1] "Waiting for core dependency." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132112 (owner: 10Gergő Tisza) [00:39:06] (03PS1) 10Ottomata: Copying glam_nara logs from erbium to stat1002 [operations/puppet] - 10https://gerrit.wikimedia.org/r/132326 [00:39:55] (03CR) 10Ottomata: [C: 032 V: 032] Copying glam_nara logs from erbium to stat1002 [operations/puppet] - 10https://gerrit.wikimedia.org/r/132326 (owner: 10Ottomata) [00:40:03] (03Restored) 10Dzahn: Introduce an admins::release user group [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [00:41:25] (03CR) 10Dzahn: [C: 032] "http://lists.wikimedia.org/pipermail/labs-l/2014-May/002433.html , http://lists.wikimedia.org/pipermail/labs-l/2014-May/002435.html etc" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132322 (https://bugzilla.wikimedia.org/65048) (owner: 10Tim Landscheidt) [00:58:19] springle, yt? [01:00:38] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [01:01:09] I see "DB connection was already closed." exceptions on various wikis [01:02:19] (03PS9) 10Dzahn: remove release users from admins::restricted [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [01:07:03] (03PS10) 10Dzahn: remove release users from admins::restricted [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [01:12:34] (03CR) 10Dzahn: [C: 032] "Hoo man, thank you for the original suggestion. as you can see i recycled it now that we have a group for it and solved it that way." [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [01:17:42] (03CR) 10Dzahn: "after I61e34e2ee and Ic4695421c now we can move more users until this becomes obsolete" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126941 (owner: 10Dzahn) [01:27:03] (03PS4) 10Dzahn: let bastion hosts have base::firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/96424 [01:27:47] (03CR) 10Dzahn: [C: 031] delete mwlib.pp? (pediapress) move to pdf/ocg? [operations/puppet] - 10https://gerrit.wikimedia.org/r/132136 (owner: 10Dzahn) [01:35:13] (03CR) 10Springle: [C: 04-1] bacula: allow mysqldumps to be kept locally (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132214 (owner: 10Alexandros Kosiaris) [01:39:50] (03CR) 10Springle: "Cool! Asked a couple questions on the other changeset." [operations/puppet] - 10https://gerrit.wikimedia.org/r/131976 (owner: 10Springle) [01:42:42] (03CR) 10Springle: Backup role::mariadb::dbstore (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 (owner: 10Alexandros Kosiaris) [01:54:28] (03CR) 10Springle: bacula: allow mysqldumps to be kept locally (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132214 (owner: 10Alexandros Kosiaris) [02:12:49] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3798 MB (3% inode=99%): [02:20:49] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3431 MB (3% inode=99%): [02:41:09] !log LocalisationUpdate completed (1.24wmf3) at 2014-05-09 02:40:06+00:00 [02:41:17] Logged the message, Master [03:00:49] RECOVERY - Disk space on virt0 is OK: DISK OK [03:09:59] !log LocalisationUpdate completed (1.24wmf4) at 2014-05-09 03:08:55+00:00 [03:10:06] Logged the message, Master [03:24:56] !log installed db106[678] [03:25:03] Logged the message, Master [03:25:19] (03PS1) 10Springle: deploy db1066 to s1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/132327 [03:27:25] (03CR) 10Springle: [C: 032] deploy db1066 to s1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/132327 (owner: 10Springle) [03:37:59] (03PS1) 10Springle: deploy db1067 to s2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/132329 [03:41:08] (03CR) 10Springle: [C: 032] deploy db1067 to s2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/132329 (owner: 10Springle) [03:46:58] !log springle synchronized wmf-config/db-eqiad.php 'reduce db1036 and db1051 load while cloning' [03:47:05] Logged the message, Master [03:47:16] !log xtrabackup clone db1036 to db1067, db1051 to db1066 [03:47:23] Logged the message, Master [03:50:28] PROBLEM - MySQL Processlist on db1051 is CRITICAL: CRIT 168 unauthenticated, 0 locked, 0 copy to table, 0 statistics [03:51:28] RECOVERY - MySQL Processlist on db1051 is OK: OK 5 unauthenticated, 0 locked, 0 copy to table, 0 statistics [03:51:42] (03PS1) 10Springle: pool db1066 in s1, db1067 in s2, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132330 [03:52:08] (03CR) 10Springle: [C: 04-2] pool db1066 in s1, db1067 in s2, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132330 (owner: 10Springle) [04:01:38] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [04:05:16] PROBLEM - mysqld processes on db1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [04:05:57] PROBLEM - mysqld processes on db1067 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [04:09:13] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri May 9 04:08:07 UTC 2014 (duration 8m 6s) [04:09:20] Logged the message, Master [05:00:15] !log installed db1067, db1070, db1071 [05:00:22] Logged the message, Master [05:18:04] twkozlowski: yep! awake in Zurich now ;) [05:30:29] PROBLEM - MySQL Processlist on db1036 is CRITICAL: CRIT 87 unauthenticated, 0 locked, 0 copy to table, 0 statistics [05:31:28] RECOVERY - MySQL Processlist on db1036 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [05:35:31] PROBLEM - MySQL Processlist on db1051 is CRITICAL: CRIT 79 unauthenticated, 0 locked, 0 copy to table, 0 statistics [05:37:31] RECOVERY - MySQL Processlist on db1051 is OK: OK 3 unauthenticated, 0 locked, 0 copy to table, 0 statistics [06:26:51] RECOVERY - mysqld processes on db1067 is OK: PROCS OK: 1 process with command name mysqld [06:29:01] PROBLEM - MySQL Replication Heartbeat on db1067 is CRITICAL: CRIT replication delay 2029 seconds [06:33:01] RECOVERY - MySQL Replication Heartbeat on db1067 is OK: OK replication delay -0 seconds [06:40:35] (03CR) 10Mwalker: "If we're going to do something with this; it should be deleted. However, the pdf[2|3] servers might still be using this?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132136 (owner: 10Dzahn) [06:41:01] RECOVERY - mysqld processes on db1066 is OK: PROCS OK: 1 process with command name mysqld [06:45:01] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [06:48:07] should I worry about those icinga warnings? [06:49:11] <_joe_> greg-g: which ones? [06:49:12] Maybe it will happen again in 50 minutes [06:49:18] <_joe_> the anomaly detection ones, not. [06:49:29] <_joe_> it's something I need to fix [06:49:30] <_joe_> :) [06:49:51] <_joe_> the ones on the new databases, I think springle knows something about that [06:51:04] kk [06:51:10] * greg-g mostly ignores then :) [06:52:54] <_joe_> I get back to my puppet misery [06:53:00] (03CR) 10Springle: [C: 032] pool db1066 in s1, db1067 in s2, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132330 (owner: 10Springle) [06:53:08] (03Merged) 10jenkins-bot: pool db1066 in s1, db1067 in s2, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132330 (owner: 10Springle) [06:53:26] <_joe_> greg-g: are you at the hackathon? [06:54:11] _joe_: yeppers :) [06:54:32] <_joe_> cool! most people in my TZ [06:55:00] Probably not most. Lots! [06:55:23] <_joe_> marktraceur: well, a lot of people in engineering let's say [06:55:38] !log springle synchronized wmf-config/db-eqiad.php 'pool db1066 in s2, db1067 in s1, warm up' [06:55:39] Good compromise [06:55:42] (03PS1) 10Springle: white space, comments, and alignment perfectionism [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132337 [06:55:46] Logged the message, Master [06:55:47] Peace in our time [06:56:30] <_joe_> springle: oh you're one of us, I see [06:56:43] * _joe_ OC indenter [06:57:00] heh. sort of. some of these took me a year to get around to doing [06:57:24] (03CR) 10Springle: [C: 032] white space, comments, and alignment perfectionism [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132337 (owner: 10Springle) [07:02:01] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [07:05:21] (03PS1) 10Springle: Raise db1066 and db1067 to normal load. Depool db1043 from s1 for reassignment. Poor thing is too small for the S1 big league now :-( [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132338 [07:06:33] (03CR) 10Springle: [C: 04-2] "until warmed up" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132338 (owner: 10Springle) [07:07:01] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [07:43:30] (03CR) 10Hoo man: "mh... I think we need a new group for some members of the restricted group which need log access, but don't need deploy rights. Eg. Daniel" [operations/puppet] - 10https://gerrit.wikimedia.org/r/126941 (owner: 10Dzahn) [08:15:01] RECOVERY - Disk space on analytics1013 is OK: DISK OK [08:16:01] !log restarting Zuul (seems some jobs are not properly registered) [08:16:08] Logged the message, Master [08:16:56] !log Jenkins: un pooled integration-slave1002 and rebooting the instance. [08:17:03] Logged the message, Master [08:18:35] !log Jenkins: un pooled integration-slave1001 and rebooting the instance. [08:18:44] Logged the message, Master [08:21:51] !log apt-get upgraded apache on gallium and lanthanum [08:21:59] Logged the message, Master [08:34:59] (03PS2) 10Reedy: Set $wgCategoryCollation to 'uca-lv' on lvwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132173 (https://bugzilla.wikimedia.org/65003) (owner: 10Odder) [08:35:03] (03CR) 10Reedy: [C: 032] Set $wgCategoryCollation to 'uca-lv' on lvwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132173 (https://bugzilla.wikimedia.org/65003) (owner: 10Odder) [08:35:13] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to 'uca-lv' on lvwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132173 (https://bugzilla.wikimedia.org/65003) (owner: 10Odder) [08:36:49] !log reedy synchronized wmf-config/InitialiseSettings.php 'I4657fe64572fb3db22e3b48a87df7112b2248e35' [08:36:56] Logged the message, Master [08:39:26] Thanks, Reedy! [08:42:26] finished [08:44:23] (03CR) 10Reedy: [C: 04-1] "reedy@tin:~$ cat test.php" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131914 (https://bugzilla.wikimedia.org/48618) (owner: 10Withoutaname) [08:45:03] it's probably about time we just went ahead and deployed these collations on all wikis in lagnauges we support [08:45:34] (03CR) 10Reedy: "Ignore me" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131914 (https://bugzilla.wikimedia.org/48618) (owner: 10Withoutaname) [08:47:09] (03CR) 10Reedy: [C: 032] Add mindat.org to $wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131968 (owner: 10Steinsplitter) [08:47:23] (03PS7) 10Reedy: Add mindat.org to $wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131968 (owner: 10Steinsplitter) [08:48:12] (03CR) 10Reedy: [C: 032] Add mindat.org to $wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131968 (owner: 10Steinsplitter) [08:48:20] (03Merged) 10jenkins-bot: Add mindat.org to $wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131968 (owner: 10Steinsplitter) [08:48:25] MatmaRex: I realised after I merged it... [08:48:27] springle: About? [08:48:36] MatmaRex: Don't the large ones upset mysql? [08:49:01] Reedy: enwiki, maybe [08:49:17] Reedy: we deployed it on frwiki a few months ago with some minor hiccups [08:49:25] plwiki (even longer ago) went smoothly [08:49:41] there's some patch in gerrit that's supposed to make the updateCollation.php script better [08:50:57] mmmm [08:52:23] Got a link? [08:52:51] yeah [08:52:54] oh, it's yours :P [08:52:56] https://gerrit.wikimedia.org/r/#/c/106162/ [08:53:08] Well, isn't lviki a small one? [08:54:28] yup [08:54:39] just shy of 300,000 [08:56:49] stupid gerrit [09:05:54] (03CR) 10Phuedx: [C: 04-1] "See inline." (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132308 (owner: 10Robmoen) [09:06:26] (03CR) 10Faidon Liambotis: [C: 031] Add new codfw allocations, core router loopbacks & transfer nets [operations/dns] - 10https://gerrit.wikimedia.org/r/132188 (owner: 10Mark Bergsma) [09:06:31] (03CR) 10Faidon Liambotis: [C: 031] Allocate codfw private IP space, create management network [operations/dns] - 10https://gerrit.wikimedia.org/r/132195 (owner: 10Mark Bergsma) [09:15:39] akosiaris: around ? [09:15:46] kind of [09:16:23] can I help you with something ? [09:16:28] Reedy: aude: Shall I deploy for wmf3? [09:16:30] short question: would it be useful to have a firewall role class with common ports: i.e role::firewall::http [09:16:32] Wikidata change, that is [09:16:45] role::firewall::https [09:16:47] hoo: feel freee [09:16:48] either fine with me [09:16:55] ok [09:17:12] change looks good [09:17:20] matanya: that just contains ferm::rule/service rescources ? [09:17:26] etc, and call those instead of explicit ferm::service on each role you want ferm rules? [09:17:27] yes [09:17:35] want me to merge it? [09:17:46] you prevent duplicate resources this way [09:17:51] how ? [09:18:28] no if !defined then define please [09:18:36] please please :-) [09:19:00] if one declares frem::service {http: on one role and then on another role the same name, you have a confilct [09:19:09] alright [09:19:19] but if you only include this won't be the case, am i right here? [09:21:14] akosiaris: the down side would be loosing flexibility on srange [09:22:52] matanya: technically correct. Not a bad idea, let me think about it a bit [09:23:25] hmm you lose the comment/reason the rule is there [09:23:50] right now we have files like /etc/ferm/blablah/90_bugzilla_http [09:24:14] which make it easy to know why a port is open. I 'd like to keep that info, it is useful [09:24:32] but not really easy when using includes [09:24:53] !log hoo synchronized php-1.24wmf3/extensions/Wikidata/ 'Fix Job injection error handling' [09:25:00] Logged the message, Master [09:25:31] deploy! [09:25:42] Mash F5 [09:25:44] matanya: other than that lgtm [09:26:35] well the role::firewall:: namespace may not be the best, then again I can not think something better [09:27:24] i can add it as a class to the module - frem/common_port.pp [09:27:25] or so [09:27:57] then include common_ports_http [09:28:52] but no so good to get module tinker with such a use case [09:28:57] exactly [09:31:03] damn, forgot to update the submodul, meh [09:31:29] gah, was worred [09:31:38] still flooding the logs [09:31:43] worried* and checking [09:32:00] I pulled and everything, but forgot to update the submodule, sorry :/ [09:32:01] i'll probably go the the maps room or the lobby now [09:32:08] * aude done that [09:32:13] ok [09:32:20] Real sync is happening now [09:32:23] ok [09:32:48] !log hoo synchronized php-1.24wmf3/extensions/Wikidata/ 'Fix Job injection error handling' [09:32:56] yay [09:32:57] there we go [09:33:07] verified [09:33:12] thanks [09:33:12] ! [09:33:14] :) [09:33:26] alright, to the lobby [09:33:30] yep [09:33:31] * aude off [09:33:43] _joe_: so $::nagios_group is not really defined anywhere right ? [09:33:58] so it is your call here, akosiaris , think of it a bit more, and let me know later on please :) [09:34:36] matanya: ok thanks for the idea! [09:34:46] 17 hours and 10 minutes [09:39:01] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [09:43:51] Reedy: back. what's up? [09:46:04] Just wondering about the category collation updates and the issues they may cause on the db servers [09:46:54] oh the thing that played havoc with the adaptive hash on frwiki a while back? [09:47:13] while, as in months ago now [09:47:29] that sounds about right :) [09:47:41] if i'm recalling the right one: it just needed the adaptive hash disabled on the relevant master for the duration [09:47:48] other traffic didn't suffer too much [09:48:03] though it was during the quieter pacific hours [09:48:22] which wiki needs doing now? [09:48:47] you're going to say one of the big ones, right? :) [09:49:03] haha [09:49:07] There's none that "need" doing [09:49:28] I think a lot of the big ones have actually been done... [09:49:44] s/a lot/some/ [09:50:14] heh ok. i'd like to baby sit the script but fine to schedule some [10:03:01] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [10:06:29] Export from toolserver fine [10:06:34] import to to tool labs [10:06:34] ERROR 1062 (23000) at line 161: Duplicate entry '????????' for key 'User' [10:06:36] ty mysql [10:07:13] You're in Zurich, Reedy? [10:07:40] I am [10:07:48] Should I hide? [10:08:03] springle: Any idea if mysqldump does something weird with encoding? [10:08:08] No, just wondering why you added that collation bug to the Zurich blocker [10:08:12] Makes sense now :-) [10:14:56] (03CR) 10Divec: "I've commented on Bugzilla 64127: We're just checking out the situation with regards to Chinese versions of the names "Translations:" and " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127584 (https://bugzilla.wikimedia.org/64127) (owner: 10Odder) [10:15:12] --default-character-set=latin1 [10:19:24] Reedy: check mysqldump version. old ones use latin1 [10:19:39] also check table char sets on the source [10:19:44] mysqldump Ver 10.13 Distrib 5.5.12, for solaris10 (i386) [10:20:03] well, not old :) [10:22:07] mysqldump --default-character-set=latin1 -N p_awb_usagestats > usagestats.sql [10:22:15] mysql --defaults-file="${HOME}"/replica.my.cnf -h tools-db --default-character-set=latin1 s51835_usagestats < usagestats.sql [10:22:41] solaris10 ? [10:22:47] toolserver [10:22:53] (migrating stuff from it) [10:23:04] yeah, somehow I just guessed [10:23:29] heh [10:24:03] tools.awb@tools-login:~$ grep ???????? usagestats.sql -c [10:24:03] 41 [10:24:03] srsly [10:24:09] Reedy: are the source tables latin1? [10:24:27] Not sure [10:24:30] I was going against http://forums.mysql.com/read.php?103,275798,281015#msg-281015 [10:24:32] let me look [10:25:56] Ugh [10:26:27] Yes, for that database [10:26:32] the other I want to do is utf8 [10:29:14] Reedy: I'd try dumping everything with mysqldump default utf8. the table defs should still get the correct latin1 (if that's still what you want) [10:29:20] also check that replica.cnf [10:29:33] see if it's setting any character set stuff [10:29:57] then look at default char set for server, connection, db, and tables on both source and dest boxes [10:30:41] then set it all to utf8, utfmb4, or binary, for future sanity :) [10:31:02] I've no idea why it's latin1 :( [10:31:35] This goes back over 6 years [10:31:55] any myisam tables? [10:32:26] I think they were innodb [10:32:50] --default-character-set=utf8 seems to be working [10:32:56] innodb, so the data itself shouldn't have dupe keys [10:33:06] Yeah [10:39:29] (03PS2) 10Springle: Raise db1066 and db1067 to normal load. Depool db1043 from s1 for reassignment. Poor thing is too small for the S1 big league now :-( [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132338 [10:39:52] (03CR) 10Springle: [C: 032] Raise db1066 and db1067 to normal load. Depool db1043 from s1 for reassignment. Poor thing is too small for the S1 big league now :-( [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132338 (owner: 10Springle) [10:39:59] (03Merged) 10jenkins-bot: Raise db1066 and db1067 to normal load. Depool db1043 from s1 for reassignment. Poor thing is too small for the S1 big league now :-( [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132338 (owner: 10Springle) [10:43:04] !log springle synchronized wmf-config/db-eqiad.php 'raise db1067 and db1066 to normal load. depool db1043' [10:43:11] Logged the message, Master [10:45:05] (03PS2) 10Giuseppe Lavagetto: Fix the use of $nagios_group. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132187 [10:46:28] [11:43:45] Hello, for some hours in Wikipedia projects special:export produces an corrupt xml-file with html (shows an internal error) at the end of the file, see http://pastebin.com/aqUGNc7Y Is this already known. Thx and best wishes [10:48:14] <_joe_> Reedy: ? [10:48:29] I'm wondering if it's possible it's related to springles db moves [10:52:44] <_joe_> Reedy: enwiki works correctly [10:53:20] I suspect it's likely if you are exporting one while the db change happens [10:54:33] <_joe_> in that case, I should see some spike in errors for that page in our varnish logs, let me check [10:54:44] no db boxes have actually been changed recently. have added new ones, and depooled one, but not turned anything off [10:54:45] <_joe_> Reedy: around what time? [10:55:26] the depooled box db1043 (enwiki) is still accessible to mw nodes [10:55:57]
[8e37d443] 2014-05-09 07:10:15: Fatal exception of type DBUnexpectedError
hi all, I'm about to merge the submodule changes I emailed about last week [11:01:04] i'm going to do them one at a time [11:01:14] and run puppet on a few relevant machines after I do so [11:01:23] to make sure they work exactly as they did before [11:01:32] this will turn nginx, mariadb and varnish into submodules [11:01:37] there should be no effective change [11:01:42] <_joe_> mmmh [11:01:48] haha [11:01:53] <_joe_> I dislike having the repo so scattered around very much [11:02:03] oh, there are such good reasons for doing so! [11:02:10] <_joe_> I hate submodules in general, more so within gerrit [11:02:11] i'm doing this for a reason, not willy-nilly :) [11:02:25] <_joe_> ottomata: I know, I'm just moaning :P [11:02:37] i'm at the hackathon now, and this is in preparation for the production-like mediawiki vagrant project [11:02:46] <_joe_> cool [11:02:48] which will allow us to use the same puppet modules in multiple environments [11:03:17] aside from having to commit in multiple places to get as submodule change out, I've liked using them so far [11:03:27] ok, nginx going first... [11:03:32] (03PS2) 10Ottomata: Removing nginx module in order to add it as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131057 [11:03:41] (03CR) 10Ottomata: [C: 032 V: 032] Removing nginx module in order to add it as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131057 (owner: 10Ottomata) [11:03:48] (03PS2) 10Ottomata: Adding nginx module as git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131058 [11:03:53] (03CR) 10Ottomata: [C: 032 V: 032] Adding nginx module as git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131058 (owner: 10Ottomata) [11:04:53] awesome, puppet-merge shows me a basically unchanged diff [11:05:03] even though the modules/nginx dir was removed and then added as a submodule [11:06:35] cool, looks good [11:06:40] mariadb next... [11:09:46] springle: yt? [11:09:52] ottomata: yep [11:10:11] so! i noticed that you recently modified the mariadb puppet module :) [11:10:23] i knew that was going to happen! ha, I shoulda merged these changes earlier :p [11:10:42] its ok, but! i just wanted to make sure you knew that I was about to make this a git submodule [11:11:10] ok :) [11:11:24] have you worked with git submodules before? [11:11:29] nope [11:11:31] k [11:11:38] the workflow for making changes is a little different [11:11:51] https://wikitech.wikimedia.org/wiki/Puppet_coding#git_submodules [11:12:04] making changes will require 2 commits [11:12:08] 1 commit to the submodule repository [11:12:16] and a commit in ops/puppet to update the submodule to the new sha [11:13:07] if you want to make a change to the submodule in the same clone that you usually work in [11:13:19] you'll have to make sure to add a remote for the submodule that you can push for review too [11:13:23] so probably the ssh:// gerrit url [11:13:32] by default the submodules are added via the anonymous https:// url [11:13:33] hmm ok [11:13:47] do you use git-review or just git to push to gerrit? [11:14:05] <_joe_> ottomata: I usually do. [11:14:13] git-review? [11:14:16] git review [11:15:05] <_joe_> ottomata: so for each submodule, I should: git remote set-url, then each time I change something there I should first git-review inside the submodule [11:15:27] <_joe_> on its own repo [11:15:36] ja ok, so if you set the 'gerrit' remote to your ssh url [11:15:38] on each submodule [11:15:44] then yeah, I think git-review will work fine [11:15:51] BUT! it gets a little weird when you switch branches [11:15:59] if you make changes in the same clone that you usually use [11:16:17] I actually have a second clone of ops/puppet that I use for making submodule changes [11:16:23] <_joe_> ottomata: moving fundamental modules out of the main puppet tree will make us miserable :( [11:16:38] sighhHHhHhhhhhh naw, you just gotta get used to it? [11:16:49] so far I've gotten one +1 and no objections on the email thread [11:17:00] and I told the vagrant folks here I'd do this so we could start working in vagrant with them [11:17:20] this doesn't thrill me, but i'm prepared to chalk that up to lack of experience [11:18:29] <_joe_> ottomata: If I need to coordinately change things in the submodule and in the main puppet repo, it's two separate commits and I get it; I was thinking of the workflow in this case [11:18:36] <_joe_> just tell me if I'm right [11:18:47] that is correct [11:18:56] <_joe_> first I commit the changes to the submodule [11:19:13] <_joe_> then I do the changes to the main repo, and change the reference to the submodule there [11:19:16] <_joe_> right? [11:19:23] yup, exactly [11:20:19] most of the time that will work 100% just fine. it could get weird when you have multiple topic branches that are not rebased, or anytime a branches's submodule commit SHA does not match what the submodule is checked out at in the production branch [11:20:21] <_joe_> ottomata: the only annoyance is that this breaks some of the tools I use, but I can modify them [11:20:30] it will be totally fine if you are not working with that submodule [11:20:41] <_joe_> I do usually have 5-6 topic branches [11:20:42] git status will show a modified file when you switch branches [11:20:58] <_joe_> but that should be ok, exactly [11:21:06] yeah, most of the time its cool [11:21:06] <_joe_> as long as I ignore that [11:21:18] <_joe_> well, that applies to git in general [11:21:19] <_joe_> :) [11:21:20] hahah [11:21:21] yup [11:29:05] ottomata: hey, reading up on submodules on puppet_coding, changes to submodules means there will be two code reviews involved correct? one for the submodule proper and one to update the commit in the main repo? [11:29:12] that is correct [11:29:49] ottomata: you make things complicated :) [11:29:58] haha [11:30:09] only for a few! I make it less complicated for others! [11:34:34] (03PS2) 10Ottomata: Removing modules/varnish in prep for adding it as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131068 [11:34:45] ok, gonna do varnish before mariadb... [11:35:27] (03CR) 10Ottomata: [C: 032 V: 032] Removing modules/varnish in prep for adding it as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131068 (owner: 10Ottomata) [11:37:01] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [11:37:41] (03PS2) 10Ottomata: Adding modules/varnish as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131069 [11:37:54] (03CR) 10Ottomata: [C: 032 V: 032] Adding modules/varnish as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131069 (owner: 10Ottomata) [11:39:37] ok great, varnish looking fine [11:40:02] Hello, I would like to ask for setting $wgMaxRedirects = 3 on enwiki. Consensus is 6:0 in 48h: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28proposals%29#Allow_some_double_redirects [11:45:21] (03PS1) 10JanZerebecki: Improve nginx TLS/SSL settings. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) [11:46:03] (03PS2) 10Ottomata: Removing mariadb module in prep for adding it as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131060 [11:46:28] (03PS3) 10Ottomata: Removing mariadb module in prep for adding it as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131060 [11:55:53] (03PS1) 10Giuseppe Lavagetto: Make nagios hostgroups and servicegroups declarations work. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132395 [11:57:58] (03CR) 10Ottomata: [C: 032 V: 032] Removing mariadb module in prep for adding it as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131060 (owner: 10Ottomata) [12:03:30] (03PS1) 10Hashar: beta: create a RTL english wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132396 (https://bugzilla.wikimedia.org/50335) [12:03:34] (03PS2) 10Ottomata: Adding modules/mariadb as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131061 [12:04:01] (03CR) 10Ottomata: [C: 032 V: 032] Adding modules/mariadb as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131061 (owner: 10Ottomata) [12:04:45] (03CR) 10Reedy: [C: 032] beta: create a RTL english wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132396 (https://bugzilla.wikimedia.org/50335) (owner: 10Hashar) [12:04:53] (03Merged) 10jenkins-bot: beta: create a RTL english wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132396 (https://bugzilla.wikimedia.org/50335) (owner: 10Hashar) [12:05:18] ok great, mariadb good too! [12:05:19] awesome [12:15:36] (03PS3) 10Giuseppe Lavagetto: Fix the use of $nagios_group. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132187 [12:16:48] (03PS16) 10Ori.livneh: Add 'rcstream' module for broadcasting recent changes over WebSockets [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (https://bugzilla.wikimedia.org/14045) [12:18:29] (03PS17) 10Ori.livneh: Add 'rcstream' module for broadcasting recent changes over WebSockets [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (https://bugzilla.wikimedia.org/14045) [12:19:21] (03PS18) 10Ori.livneh: Add 'rcstream' module for broadcasting recent changes over WebSockets [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (https://bugzilla.wikimedia.org/14045) [12:19:35] (03PS19) 10Ori.livneh: Add 'rcstream' module for broadcasting recent changes over WebSockets [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (https://bugzilla.wikimedia.org/14045) [12:21:46] (03CR) 10Faidon Liambotis: [C: 032] "There's still a few details to figure out for a production deployment, but this looks great for an initial push. Cheers!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (https://bugzilla.wikimedia.org/14045) (owner: 10Ori.livneh) [12:25:01] Why does en-rtl.wikipedia.org redirect to https://www.wikimedia.ch/ ? [12:25:18] Did redirect conf get broken again with half of our domains fallig through? [12:25:25] paravoid: [12:25:27] Reedy: [12:25:57] lol [12:32:06] stream.wmflabs.org -> https://www.wikimedia.ch/ [12:32:09] that's bad [12:32:26] all our fallbacks are going to .ch [12:32:27] wtf [12:33:34] <_joe_> Krinkle: this may be related? https://gerrit.wikimedia.org/r/132396 [12:33:51] Nah, that's just one of many domains that arent configured [12:33:57] that's how I discovered it, but it's unrealted [12:34:09] <_joe_> ok [12:34:33] <_joe_> I really don't know where we do those redirects :) [12:34:40] https://github.com/wikimedia/operations-apache-config [12:34:42] I do [12:35:09] This happens once a year when someone or something screws up RedirectConf, because it has a switch-case fallthrough feature that can accidentally be triggered [12:36:14] Krinkle: hey [12:37:35] Reedy: paravoid: Nevermind, it's on our end. The venue at the hackathon has a weird DNS that falls back to that domain [12:37:55] ori-l: [12:38:01] lol [12:38:06] It's fscked [12:38:07] <_joe_> Krinkle: BRRR [12:39:32] <_joe_> Krinkle: ugh that file is scary in fact [12:39:39] It gets better [12:39:49] ever since people broke it too many times, it is now auto-generated [12:40:00] which helps, but also makes it more difficult [12:40:09] Auto-generated httpd conf [12:40:10] :) [12:40:11] Yay [12:49:03] (03PS1) 10Ori.livneh: rcstream: update README [operations/puppet] - 10https://gerrit.wikimedia.org/r/132407 [12:49:13] (03PS2) 10Ori.livneh: rcstream: update README [operations/puppet] - 10https://gerrit.wikimedia.org/r/132407 [12:49:18] (03CR) 10jenkins-bot: [V: 04-1] rcstream: update README [operations/puppet] - 10https://gerrit.wikimedia.org/r/132407 (owner: 10Ori.livneh) [12:49:37] -1 a README change? [12:49:38] boo [12:50:25] <_joe_> hey ori-l :) [12:50:33] hey _joe_! [12:50:47] (03CR) 10Ori.livneh: [C: 032 V: 032] rcstream: update README [operations/puppet] - 10https://gerrit.wikimedia.org/r/132407 (owner: 10Ori.livneh) [12:51:10] (03CR) 10Mark Bergsma: [C: 032] Add new codfw allocations, core router loopbacks & transfer nets [operations/dns] - 10https://gerrit.wikimedia.org/r/132188 (owner: 10Mark Bergsma) [12:51:38] (03CR) 10Mark Bergsma: [C: 032] Allocate codfw private IP space, create management network [operations/dns] - 10https://gerrit.wikimedia.org/r/132195 (owner: 10Mark Bergsma) [12:53:33] (03PS1) 10Jforrester: Enable VisualEditor as a Beta Feature on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132409 (https://bugzilla.wikimedia.org/65067) [12:55:06] (03PS1) 10Mark Bergsma: Add cr1/cr2-codfw loopbacks [operations/dns] - 10https://gerrit.wikimedia.org/r/132411 [12:56:42] (03CR) 10Mark Bergsma: [C: 032] Add cr1/cr2-codfw loopbacks [operations/dns] - 10https://gerrit.wikimedia.org/r/132411 (owner: 10Mark Bergsma) [12:59:50] (03CR) 10Filippo Giunchedi: [C: 031] Fix the use of $nagios_group. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132187 (owner: 10Giuseppe Lavagetto) [13:00:44] (03CR) 10Krinkle: "fyi: This change is relying on data['server_name'] to exist which has been proposed in I503a0134a8613dad but isn't merged yet." [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (https://bugzilla.wikimedia.org/14045) (owner: 10Ori.livneh) [13:04:01] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [13:07:51] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This is unfortunately not the way to solve this. The reason is that the service/host groups would be exported multiple times, collected/re" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132395 (owner: 10Giuseppe Lavagetto) [13:07:59] sorry _joe_ [13:09:15] <_joe_> akosiaris: np [13:09:36] _joe_: the default for $group in monitor_host (that is $::nagios_group) was never set anywhere [13:09:43] context: https://gerrit.wikimedia.org/r/#/c/132187/3/manifests/nagios.pp [13:09:58] <_joe_> akosiaris: exactly [13:09:59] so I don't even know what is like that [13:10:32] <_joe_> akosiaris: I think that was an error [13:11:08] <_joe_> still, since this is a servicegroup, I'd reserve it for specific metrics on one host [13:11:37] <_joe_> or, hostgroups == servicegroups [13:12:01] <_joe_> so if we want metrics in servicegroups we need to declare that explicitly [13:12:34] <_joe_> so, the behaviour we had was correct, in a sense [13:12:38] <_joe_> and I preserved it [13:13:34] yeah I see what you mean [13:17:01] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [13:24:24] !log reedy updated /a/common to {{Gerrit|I75a80a998}}: beta: create a RTL english wiki [13:24:32] Logged the message, Master [13:24:41] (03PS1) 10Reedy: Add en-rtl to wgExtraLanguageNames for beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132415 [13:25:05] (03PS2) 10Reedy: Add en-rtl to wgExtraLanguageNames for beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132415 (https://bugzilla.wikimedia.org/50335) [13:25:11] (03CR) 10Reedy: [C: 032] Add en-rtl to wgExtraLanguageNames for beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132415 (https://bugzilla.wikimedia.org/50335) (owner: 10Reedy) [13:26:35] (03Merged) 10jenkins-bot: Add en-rtl to wgExtraLanguageNames for beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132415 (https://bugzilla.wikimedia.org/50335) (owner: 10Reedy) [13:34:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Some minor comments like having a value for $cluster being undef, or else you might end with a hostgroup being a single _eqiad for example" (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132187 (owner: 10Giuseppe Lavagetto) [13:39:56] (03CR) 10Giuseppe Lavagetto: "I will amend the conditional, for $cluster see my comments." (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132187 (owner: 10Giuseppe Lavagetto) [13:40:52] !log reedy updated /a/common to {{Gerrit|I721c36406}}: Add en-rtl to wgExtraLanguageNames for beta [13:40:59] Logged the message, Master [13:41:35] _joe_: argh you are right [13:41:42] we got that line in site.pp [13:41:46] $cluster = 'misc' [13:41:51] ok then [13:42:24] <_joe_> akosiaris: I know this matter [13:42:38] <_joe_> akosiaris: also, we were both wrong on the simplification of that conditiona [13:42:42] <_joe_> *l [13:43:00] <_joe_> as soon as I watched at the code in an editor and not on gerrit, it was clear to me :) [13:43:52] (03CR) 10Giuseppe Lavagetto: Fix the use of $nagios_group. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132187 (owner: 10Giuseppe Lavagetto) [13:44:39] <_joe_> akosiaris: the sole fact that we are both confused by this whole dynamic lookup mess tells something about the puppet DSL, doesn't it? [13:46:05] :-( [13:54:45] (03CR) 10Alexandros Kosiaris: Fix the use of $nagios_group. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132187 (owner: 10Giuseppe Lavagetto) [13:58:10] (03CR) 10coren: [C: 032] "Seems legit." [operations/puppet] - 10https://gerrit.wikimedia.org/r/130071 (owner: 10Krinkle) [13:58:20] (03PS2) 10coren: toollabs/sql: Remove unused 'list', remove duplicate 'commons' [operations/puppet] - 10https://gerrit.wikimedia.org/r/130071 (owner: 10Krinkle) [14:01:07] (03PS4) 10coren: toollabs/sql: Add handling for connecting to "meta_p" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130073 (owner: 10Krinkle) [14:01:53] (03CR) 10coren: [C: 032] "The choice of S7 is arbitrary, but pretty much any other would have done anyways." [operations/puppet] - 10https://gerrit.wikimedia.org/r/130073 (owner: 10Krinkle) [14:03:25] (03PS9) 10coren: toollabs/sql: Fix argument forwarding (-v breaks mysql) and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/113755 (owner: 10Hoo man) [14:08:32] (03CR) 10coren: [C: 04-1] "Inline comments." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/113755 (owner: 10Hoo man) [14:08:50] greg-g, Reedy: going to deploy the Flow fixes. I put them in Deployments [14:09:40] * greg-g already archived this week ;) [14:09:43] * greg-g goes to fix, maybe [14:10:01] ah, I see it, nice [14:11:08] (03CR) 10coren: "In other words, having:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/113755 (owner: 10Hoo man) [14:19:26] (03CR) 10coren: "I see a limited set of redirect for a handful of very specific tools. Is there a pattern to which were picked?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/108465 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [14:19:37] (03PS1) 10Giuseppe Lavagetto: Simplify and speedup the catalog comparator. [operations/software] - 10https://gerrit.wikimedia.org/r/132425 [14:19:40] (03CR) 10jenkins-bot: [V: 04-1] Simplify and speedup the catalog comparator. [operations/software] - 10https://gerrit.wikimedia.org/r/132425 (owner: 10Giuseppe Lavagetto) [14:19:52] (03PS1) 10Mark Bergsma: Update rancid device list [operations/puppet] - 10https://gerrit.wikimedia.org/r/132426 [14:20:43] (03CR) 10Mark Bergsma: [C: 032] Update rancid device list [operations/puppet] - 10https://gerrit.wikimedia.org/r/132426 (owner: 10Mark Bergsma) [14:24:09] (03PS2) 10Giuseppe Lavagetto: Simplify and speedup the catalog comparator. [operations/software] - 10https://gerrit.wikimedia.org/r/132425 [14:24:12] greg-g yeah, I added Friday to next week. 𝄞♪♬ 8 days a week [14:25:31] (03CR) 10coren: [C: 04-1] "There is an issue with most of those domain names where SSL will break because our combined wildcards does not cover those." (031 comment) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/108465 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [14:31:25] (03CR) 10Giuseppe Lavagetto: [C: 032] Simplify and speedup the catalog comparator. [operations/software] - 10https://gerrit.wikimedia.org/r/132425 (owner: 10Giuseppe Lavagetto) [14:34:02] (03PS1) 10Ori.livneh: Move rcstream server implementation to external repo [operations/puppet] - 10https://gerrit.wikimedia.org/r/132429 [14:36:22] ^ Krinkle [14:36:29] er, wrong patch [14:42:44] <_joe_> ori-l: how do you plan to deploy rcstream? with trebuchet? [14:44:09] _joe_: yes, the patch above makes it a trebuchet deployment target [14:44:26] already created mediawiki/services/rcstream in gerrit, per discussion w/faidon [14:44:31] paravoid: https://bugzilla.wikimedia.org/show_bug.cgi?id=65074 [14:44:33] faidon is pushing me to rewrite it in nodejs [14:44:40] just kidding about the last part [14:44:45] <_joe_> lol [14:44:46] s/nodejs/ruby/ [14:44:55] <_joe_> I vote lua [14:49:10] lua is for VCL apps [14:49:12] 'vclapps' [14:52:52] !log spage synchronized php-1.24wmf4/extensions/Flow 'Fix Flow add new topics and reply in 1.24wmf4' [14:52:59] Logged the message, Master [14:57:01] all done [14:57:14] (03PS2) 10Jgreen: Add oit admin group with jkrauska, Add access to sanger RT #7428 [operations/puppet] - 10https://gerrit.wikimedia.org/r/131863 (owner: 10Jkrauska) [14:58:08] (03CR) 10Jgreen: [C: 032 V: 031] Add oit admin group with jkrauska, Add access to sanger RT #7428 [operations/puppet] - 10https://gerrit.wikimedia.org/r/131863 (owner: 10Jkrauska) [15:03:08] !log rebuilding hewiki's cirrus index so it can pick up hebmorph too [15:03:15] Logged the message, Master [15:07:44] (03CR) 10Krinkle: Set up redirects for toolserver.org (031 comment) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/108465 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [15:17:01] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1289.400024 [15:17:01] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1153.033325 [15:17:01] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1270.333374 [15:17:21] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1164.966675 [15:17:31] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1146.033325 [15:17:52] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1190.5 [15:17:52] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1385.43335 [15:18:01] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1227.633301 [15:20:24] <_joe_> this usually means we have a kafka problem? [15:20:51] yeah [15:21:21] I haven't followed the varnishkafka stuff as closely as I should have, I have no idea what's up with it when that starts happening [15:21:32] I think last time someone restarted some related processes somewhere [15:22:01] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:22:01] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:22:01] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:22:21] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:22:37] either that or it was due to a network flap? that would explain cp3xxx anyways, but maybe not 1xxx [15:22:51] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:23:01] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:23:03] May 9 15:21:48 cp1057 varnishkafka[22693]: PRODUCE: Failed to produce Kafka message (seq 12890578595): No buffer space available (500000 messages in outq) [15:23:07] spam spam spam in syslog [15:23:31] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:23:43] ah [15:23:44] ay 9 15:20:49 cp1057 rsyslogd-2177: imuxsock begins to drop messages from pid 22693 due to rate-limiting [15:23:51] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:24:24] so the syslog spam is probably even worse than it looks, but rsyslog is trying to save us :) [15:27:29] derrs! [15:31:17] (03PS1) 10Manybubbles: Do not optimize commons for new highlighter [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132439 [15:31:55] (03CR) 10Manybubbles: "Unfortunately I think I need to act on this today." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132439 (owner: 10Manybubbles) [15:32:21] PROBLEM - Varnishkafka Delivery Errors on cp1054 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2446.866699 [15:32:23] bblack, _joe_ [15:32:50] see last comment in this RT [15:32:51] https://rt.wikimedia.org/Ticket/Display.html?id=6877 [15:32:52] also [15:33:11] PROBLEM - Varnishkafka Delivery Errors on amssq49 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1512.06665 [15:33:16] if this happens, usually [15:33:17] https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Kafka/Administration#Replica_Elections [15:33:19] will fix it [15:33:21] PROBLEM - Varnishkafka Delivery Errors on cp4002 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2381.399902 [15:33:21] PROBLEM - Varnishkafka Delivery Errors on cp1065 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1937.466675 [15:33:27] this severity of the problem is new though [15:33:31] since we recently added text data into kafka [15:33:42] so the amount of data it is hosting is more than before [15:34:01] PROBLEM - Varnishkafka Delivery Errors on cp4008 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1694.533325 [15:34:11] PROBLEM - Varnishkafka Delivery Errors on cp1067 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2193.333252 [15:34:21] RECOVERY - Varnishkafka Delivery Errors on cp1054 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:34:31] PROBLEM - Varnishkafka Delivery Errors on cp4004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2258.300049 [15:34:32] PROBLEM - Varnishkafka Delivery Errors on cp1066 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2203.399902 [15:34:41] PROBLEM - Varnishkafka Delivery Errors on amssq62 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2018.0 [15:34:41] PROBLEM - Varnishkafka Delivery Errors on cp1055 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2166.966553 [15:34:51] PROBLEM - Varnishkafka Delivery Errors on amssq50 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1895.333374 [15:35:01] PROBLEM - Varnishkafka Delivery Errors on cp1069 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1872.400024 [15:35:01] PROBLEM - Varnishkafka Delivery Errors on amssq56 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1978.866699 [15:35:01] PROBLEM - Varnishkafka Delivery Errors on amssq53 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2025.06665 [15:35:06] i just did  a replica election [15:35:11] PROBLEM - Varnishkafka Delivery Errors on cp4010 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2063.866699 [15:35:21] RECOVERY - Varnishkafka Delivery Errors on cp4002 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:35:35] i believe they should recover shortly... [15:35:49] i am not sure why this happens [15:35:51] PROBLEM - Varnishkafka Delivery Errors on cp4001 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2066.866699 [15:35:51] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1862.866699 [15:36:01] PROBLEM - Varnishkafka Delivery Errors on cp1057 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2066.399902 [15:36:01] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2171.600098 [15:36:01] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1820.866699 [15:36:04] a broker loses a connection to zookeeper for more than 10 seconds [15:36:11] PROBLEM - Varnishkafka Delivery Errors on cp4016 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1033.733276 [15:36:14] the connection times out [15:36:21] PROBLEM - Varnishkafka Delivery Errors on cp4018 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2246.06665 [15:36:21] RECOVERY - Varnishkafka Delivery Errors on cp1065 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:36:31] PROBLEM - Varnishkafka Delivery Errors on cp1070 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1884.833374 [15:36:31] RECOVERY - Varnishkafka Delivery Errors on cp4004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:36:32] RECOVERY - Varnishkafka Delivery Errors on cp1066 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:36:41] PROBLEM - Varnishkafka Delivery Errors on amssq52 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1939.06665 [15:36:41] PROBLEM - Varnishkafka Delivery Errors on cp4017 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1137.699951 [15:36:49] a broker is removed from the leader for topics [15:36:51] PROBLEM - Varnishkafka Delivery Errors on cp1056 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1918.366699 [15:36:51] RECOVERY - Varnishkafka Delivery Errors on cp4001 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:36:59] then the other broker is 100% relied on to serve traffic [15:37:01] RECOVERY - Varnishkafka Delivery Errors on cp1069 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:37:01] PROBLEM - Varnishkafka Delivery Errors on amssq55 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 574.466675 [15:37:04] previously this had been fine [15:37:11] RECOVERY - Varnishkafka Delivery Errors on cp1067 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:37:13] now I guess it is too much traffic for one broker, at least as is [15:37:21] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1570.666626 [15:37:21] PROBLEM - Varnishkafka Delivery Errors on amssq58 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1522.166626 [15:37:31] PROBLEM - Varnishkafka Delivery Errors on cp1060 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 234.199997 [15:37:41] RECOVERY - Varnishkafka Delivery Errors on amssq62 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:37:41] RECOVERY - Varnishkafka Delivery Errors on cp1055 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:38:01] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 754.333313 [15:38:11] RECOVERY - Varnishkafka Delivery Errors on amssq49 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:38:51] RECOVERY - Varnishkafka Delivery Errors on amssq50 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:01] RECOVERY - Varnishkafka Delivery Errors on cp1057 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:01] RECOVERY - Varnishkafka Delivery Errors on cp4008 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:01] RECOVERY - Varnishkafka Delivery Errors on amssq53 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:01] RECOVERY - Varnishkafka Delivery Errors on amssq56 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:01] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:01] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:11] RECOVERY - Varnishkafka Delivery Errors on cp4010 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:21] RECOVERY - Varnishkafka Delivery Errors on amssq58 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:32] RECOVERY - Varnishkafka Delivery Errors on cp1070 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:41] RECOVERY - Varnishkafka Delivery Errors on cp4017 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:51] RECOVERY - Varnishkafka Delivery Errors on cp1056 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:39:51] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:40:11] RECOVERY - Varnishkafka Delivery Errors on cp4016 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:40:21] RECOVERY - Varnishkafka Delivery Errors on cp4018 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:40:21] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:40:31] RECOVERY - Varnishkafka Delivery Errors on cp1060 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:40:41] RECOVERY - Varnishkafka Delivery Errors on amssq52 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:41:01] RECOVERY - Varnishkafka Delivery Errors on amssq55 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:41:01] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:41:57] (03CR) 10Chad: [C: 032] Do not optimize commons for new highlighter [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132439 (owner: 10Manybubbles) [15:42:06] (03Merged) 10jenkins-bot: Do not optimize commons for new highlighter [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132439 (owner: 10Manybubbles) [15:42:58] grrrit-wm: meesa gonna sync that and unfuck file searches [15:43:09] <^d> I sync'ded [15:43:17] ^d: oh, you are my hero [15:43:26] <^d> Anytime! [15:43:51] !log demon synchronized wmf-config/CirrusSearch-common.php 'Do not optimize commons for new highlighter on commons' [15:43:55] Logged the message, Master [15:44:01] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 350.333344 [15:45:00] !log reindexing commons to unbreak file searches on wikis not using the experimental highlighter [15:45:06] Logged the message, Master [15:47:01] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:52:48] (03PS4) 10Alexandros Kosiaris: Backup role::mariadb::dbstore [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 [15:52:50] (03PS3) 10Alexandros Kosiaris: bacula: allow mysqldumps to be kept locally [operations/puppet] - 10https://gerrit.wikimedia.org/r/132214 [15:54:06] (03CR) 10Alexandros Kosiaris: "So in the interest of making it configurable I have added a new parameter at the role class. We normally avoid parameterized role classes " [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 (owner: 10Alexandros Kosiaris) [15:54:22] (03CR) 10Alexandros Kosiaris: Backup role::mariadb::dbstore (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 (owner: 10Alexandros Kosiaris) [15:55:09] (03PS1) 10Andrew Bogott: Rearrange handling of the 'vagrant' user for labs vagrant. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132446 [15:56:04] (03CR) 10ArielGlenn: "not sure what we want /etc/cluster for. can we not just /usr/local/bin/dologmsg on the host on which this runs (tin or whatever)? Why " [operations/puppet] - 10https://gerrit.wikimedia.org/r/132011 (owner: 10Dzahn) [15:56:47] akosiaris: [15:56:48] (03CR) 10Alexandros Kosiaris: bacula: allow mysqldumps to be kept locally (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132214 (owner: 10Alexandros Kosiaris) [15:56:59] it hink role classes are meant to be configured via global vars :/ [15:57:03] that's how they also work in labs [15:57:11] so, I usually just do [15:57:12] ottomata: yes [15:57:22] not meant to be parameterized [15:57:34] $myvar = $::globalvar ? { [15:57:34] undef = 'default_value', [15:57:34] default = $::globalvar, [15:57:35] } [15:57:37] basically role classes are the configuration [15:58:06] (03CR) 10ArielGlenn: [C: 031] "/srv relocation for pybal conf is good, and tin is a fine host for it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/130614 (owner: 10Dzahn) [15:58:24] on the move, bbl [15:58:28] akosiaris: how much do you know about our varnish setup, e.g. frontends and backends and directors [15:58:32] k, cool, i'll getcha later [15:58:42] I am probably coming over to you anyway [16:00:24] (03CR) 10ArielGlenn: [C: 031] "we might need that trailing slash in the rsync source (prevents subdir from being created)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130610 (owner: 10Dzahn) [16:01:01] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Fri May 9 13:00:12 2014 [16:01:13] (03CR) 10Jgreen: [V: 032] Add oit admin group with jkrauska, Add access to sanger RT #7428 [operations/puppet] - 10https://gerrit.wikimedia.org/r/131863 (owner: 10Jkrauska) [16:04:39] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [16:20:49] (03CR) 10Sumanah: "Checking whether we currently think this topic ought to be reopened, per http://www.gossamer-threads.com/lists/wiki/wikitech/335559?do=pos" [operations/puppet] - 10https://gerrit.wikimedia.org/r/49678 (owner: 10Ottomata) [16:30:39] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Fri May 9 16:30:31 UTC 2014 [16:39:46] (03PS1) 10ArielGlenn: snapshot module: run everything out of /srv finally [operations/puppet] - 10https://gerrit.wikimedia.org/r/132456 [16:40:30] (03CR) 10ArielGlenn: [C: 032] snapshot module: run everything out of /srv finally [operations/puppet] - 10https://gerrit.wikimedia.org/r/132456 (owner: 10ArielGlenn) [16:40:42] slow jenkins is slow [16:42:48] (03PS2) 10Robmoen: Enable anonymous editor acquisition experiment across labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132308 [16:49:58] (03PS1) 10ArielGlenn: snapshots, dumps and datasets all use datasets user instead of backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/132458 [16:58:30] (03CR) 10ArielGlenn: [C: 032] snapshots, dumps and datasets all use datasets user instead of backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/132458 (owner: 10ArielGlenn) [17:00:32] (03CR) 10Steinsplitter: [C: 031] Enable VisualEditor as a Beta Feature on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132409 (https://bugzilla.wikimedia.org/65067) (owner: 10Jforrester) [17:07:29] PROBLEM - Disk space on analytics1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 115899 MB (6% inode=99%): /var/lib/hadoop/data/k 75100 MB (3% inode=99%): /var/lib/hadoop/data/d 83491 MB (4% inode=99%): /var/lib/hadoop/data/l 101058 MB (5% inode=99%): /var/lib/hadoop/data/e 90547 MB (4% inode=99%): /var/lib/hadoop/data/g 97568 MB (5% inode=99%): /var/lib/hadoop/data/h 95827 MB (5% inode=99%): /var/lib/hadoop/data/ [17:20:56] got mark? [17:21:16] anyone with juniper-fu online? [17:27:01] (03CR) 10Dzahn: "i don't really know why we cared back then, it's a question for the original author. i just expected it to be likely used in other places " [operations/puppet] - 10https://gerrit.wikimedia.org/r/132011 (owner: 10Dzahn) [17:28:09] manybubbles: https://gerrit.wikimedia.org/r/#/c/132097/ [17:28:24] manybubbles: Care to explain in five words what it does for me as a regular user? [17:29:27] twkozlowski: As a normal user? It should improve the article snippets. [17:29:51] twkozlowski: for me, its much faster. its much easier to customize and iterate on because its our code [17:30:35] manybubbles: Just wondering whether it's newsworthy enough to include in Tech News that's coming out on Monday [17:31:01] manybubbles: You went outside the word limit, too! :-)) [17:31:06] maybe, maybe. more in a "if you see something unexpected please let us know" kind of way then in a "look at this coolness" [17:31:24] twkozlowski: snippets better [17:31:48] mhm [17:43:34] cajoel: use port 37 of either switch [17:49:26] (03PS3) 10Reedy: Upgrade to jquery 2.1.1 and jquery-ui 1.10.4 [operations/software] - 10https://gerrit.wikimedia.org/r/125883 [17:49:59] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [17:52:24] matanya and ^d: https://github.com/synhershko/elasticsearch-analysis-hebrew/issues/2 [17:52:42] I'm not sure the hebmorph is ready for us. I aborted switching hewiki to it [17:52:52] but the others still crash for some searches because of this [17:53:17] we didn't see it in beta because its some searches. [17:53:27] or maybe some words [17:53:30] I dunno [17:54:50] <^d> Why can't everyone just use English? [17:54:52] <^d> [17:55:51] Why would you care? The language wasn't invented in your country. [17:55:53] [17:55:57] :-PP [17:56:38] They've spent enough years bastardizing it [17:57:12] asks reedy to install us-en.wikipedia.org [17:57:25] wait, no. us.en :) [17:57:58] <^d> No no no. Get rid of the language code entirely. Just put it all at wikipedia.org [17:58:05] <^d> Or w.wiki, depending on your preference. [17:58:10] mutante: en.us if anything :) [17:58:13] as in en-gb [17:58:23] manybubbles, if you need to repro just export a few pages with their dependencies, don't copypaste manually [17:58:26] ori suggested tldr.wikipedia.org [17:58:49] <^d> As an alias for simple? [17:59:00] (that was about https://github.com/synhershko/elasticsearch-analysis-hebrew/issues/2 ) [18:12:30] twkozlowski: en.us.wiki ! :p [18:19:00] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [18:28:29] MaxSem: that'd require having the guy set up cirrus and mediawiki, though. [18:29:12] i mean, to get a minimum repro locally [18:41:48] !log Created EducationProgram tables on ukwiki [18:41:55] Logged the message, Master [18:44:46] !log reedy updated /a/common to {{Gerrit|I25d891030}}: Do not optimize commons for new highlighter [18:44:53] Logged the message, Master [18:45:15] (03PS1) 10Reedy: Remove 1.23wmf13 through 1.23wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132483 [18:45:38] (03CR) 10Reedy: [C: 032] Remove 1.23wmf13 through 1.23wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132483 (owner: 10Reedy) [18:45:46] (03Merged) 10jenkins-bot: Remove 1.23wmf13 through 1.23wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132483 (owner: 10Reedy) [18:45:59] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [18:48:09] !log reedy synchronized docroot and w [18:48:16] Logged the message, Master [18:52:05] !log reedy updated /a/common to {{Gerrit|I0d1ea1639}}: Remove 1.23wmf13 through 1.23wmf20 [18:52:10] Logged the message, Master [18:52:48] (03PS1) 10Reedy: Enable EduacationProgram on ukwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132484 (https://bugzilla.wikimedia.org/64143) [18:54:20] (03CR) 10Reedy: [C: 032] Enable EduacationProgram on ukwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132484 (https://bugzilla.wikimedia.org/64143) (owner: 10Reedy) [18:54:27] (03Merged) 10jenkins-bot: Enable EduacationProgram on ukwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132484 (https://bugzilla.wikimedia.org/64143) (owner: 10Reedy) [18:56:00] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable EducationProgram on ukwiki' [18:56:07] Logged the message, Master [19:05:39] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [19:13:40] (03PS1) 10Jforrester: Re-enable Parsoid on private wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132487 [19:21:59] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [19:22:25] (03CR) 10Reedy: [C: 032] Re-enable Parsoid on private wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132487 (owner: 10Jforrester) [19:22:33] (03Merged) 10jenkins-bot: Re-enable Parsoid on private wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132487 (owner: 10Jforrester) [19:24:36] (03CR) 10Phuedx: [C: 031] "This looks good to me." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132308 (owner: 10Robmoen) [19:24:54] !log reedy synchronized wmf-config/ 'I5265c408443212536a5ed96d910caba50c22e767' [19:25:02] Logged the message, Master [19:26:40] getting 5xx spikes again... [19:26:41] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-4hours&from=-4hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [19:27:20] PHP Fatal error: Call to a member function getDBkey() on a non-object in /usr/local/apache/common-local/php-1.24wmf4/extensions/Flow/includes/Parsoid/BadImageRemover.php on line 53 [19:29:26] Possibly a lot of exceptions [19:31:52] Exception from line 1093 of /usr/local/apache/common-local/php-1.24wmf3/includes/db/Database.php: DB connection was already closed. [19:32:28] timing-wise it seems to line up with https://gerrit.wikimedia.org/r/#/c/132483/ ? [19:32:52] although I don't understand what that would have to do wih anything [19:32:58] Mmmm [19:33:03] The code shouldn't be used by anything [19:34:26] seems to have dropped off again [19:35:06] the dropoff also seems to line up with that re-enable Parsoid merge [19:37:26] <_joe_> Reedy: so we were connecting too much to the db, or we have some bug. [19:37:30] Going back up again [19:38:59] <_joe_> seems to coincide with the releases maybe? [19:39:17] 32 PHP Fatal error: Call to a member function isAnon() on a non-object in /usr/local/apache/common-local/php-1.24wmf3/extensions/MobileFrontend/includes/specials/ [19:39:17] SpecialUploads.php on line 19 [19:40:15] <_joe_> Reedy: btw, where do we collect php errors? fluorine right? [19:41:05] yup [19:41:10] /a/mw-log/fatal.log [19:42:04] <_joe_> on mw1008 (chosen at random), we have more than 1K connections to mysql that are in TIME_WAIT [19:42:35] <_joe_> I'm choosing the wrong moment to study our php stack, but... do we use mysql persistent connections? [19:43:25] <_joe_> oh ok, it's allright. [19:48:49] (03PS1) 10Reedy: Memory limit to 256M [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132494 [19:49:00] (03PS1) 10Rush: ircd-ratbox and udpmxircecho puppetized [operations/puppet] - 10https://gerrit.wikimedia.org/r/132495 [19:49:04] ooh [19:50:05] ops channel 10:43:34 [19:50:29] <_joe_> Reedy: are you sure its harmless to raise the memory limit? [19:50:36] Usually [19:50:40] And I wasn't going to merge it now :) [19:50:45] I've slowly been increasing it for a while [19:50:56] <_joe_> let me check something [19:51:24] The appserver pool suggests it's only using about 40% of the total ram [19:51:43] uh, ignore that [19:52:39] <_joe_> the number to check is how much of its total ram it could be using [19:53:15] total 11.7. use 3. cache 6 [19:53:24] <_joe_> since we have a (sane IMHO) zero-swap policy. [19:53:53] so that'ts (256 - 220) * apache max threads or whatever [19:55:17] <_joe_> why 256 - 220? [19:55:26] new - old [19:55:30] <_joe_> it's 256 * max thread + apc shm :) [19:56:43] <_joe_> anyway, I'm a newbie, so ask people with more experience of the stack [19:57:11] PHP Fatal error: Call to a member function item() on a non-object in /usr/local/apache/common-local/php-1.24wmf3/includes/parser/Preprocessor_DOM.php on line 1692 [19:57:25] <_joe_> mmmh [20:00:48] the db servers don't seem horribly perturbed in ganglia [20:01:21] <_joe_> bblack: no they don't [20:01:52] <_joe_> bblack: and php threads do not stop suddenly waiting for read from the db [20:04:22] Not sure DB connection was already closed is worth worrying about [20:05:06] <_joe_> Reedy: no, that is usually the low-hanging fruit [20:05:28] <_joe_> and we would have many many more errors at this point [20:05:46] <_joe_> but since the release we have almost-regular spikes of errors [20:05:54] I guess you've tailed fatal.log and exception.log? ;) [20:06:03] <_joe_> let me see if there'se a pattern on the frontend [20:06:09] <_joe_> Reedy: yes I am :) [20:06:49] (03CR) 10Robmoen: "Yes" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132308 (owner: 10Robmoen) [20:06:58] dberror logs are quiet anyway [20:07:01] <_joe_> so it's a lot of memory limit reached [20:07:26] <_joe_> in a preg_replace_callback [20:07:35] We have some big articles [20:07:38] These errors tend to come and go [20:07:46] But why I made the commit to bump the memory limit [20:07:56] <_joe_> the evil function that's so cool when you code and so evil when it runs in production [20:07:59] We can get OOMs all over hte place [20:08:06] but then we call it 150000000 times [20:08:15] yeah even that specific one has been in the logs way longer than this spiky stuff [20:09:21] hmm, where's that fatal vs exception graph [20:09:31] <_joe_> bblack: I'm trying to find patterns, since it's the first time I see those logs... [20:10:04] yeah [20:10:19] * Reedy finds his ldap password [20:10:21] as far as icinga goes, only odd alerts are on analytics [20:10:33] Reedy: in the fatals log? :) [20:12:12] <_joe_> Reedy: 2014-05-09 20:03:18 mw1131 commonswiki: [51705b97] /w/api.php Exception from line 1162 of /usr/local/apache/common-local/php-1.24wmf3/includes/db/Database.php: A database error has occurred. Did you forget to run maintenance/update.php after upgrading? See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script [20:12:21] <_joe_> this is a false positive as well? [20:12:33] <_joe_> I mean the error message :) [20:13:03] It's usually the cae [20:13:07] *case [20:13:12] let me look at the error [20:13:34] deadlock [20:13:50] so yeah, that error message is really unrelated [20:14:19] <_joe_> thanks :) [20:15:12] oh what's the damn logstash frontend [20:15:54] 2nd time lucky [20:18:47] <_joe_> bblack: I was looking at error on oxygen, and it does not correspond with the graph [20:20:44] <_joe_> I have a total of 1000 5xx in one hour between 19 and 20 UTC [20:20:57] Do we still have 5xx logs that include the request url? [20:21:22] <_joe_> yes [20:21:49] <_joe_> but as I just said, they don't add up with the numbers on graphite [20:25:16] <_joe_> the highest number per minute I see in the last three hours is 197 at 18:31 [20:25:39] <_joe_> mmmh [20:29:59] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [20:42:05] (03PS1) 10Dzahn: initial commit for a phabricator module (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 [20:43:54] (03PS2) 10Dzahn: initial commit for a phabricator module (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 [20:45:06] (03CR) 10jenkins-bot: [V: 04-1] initial commit for a phabricator module (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [20:48:04] (03PS3) 10Dzahn: initial commit for a phabricator module (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 [20:52:33] (03PS4) 10Dzahn: initial commit for a phabricator module (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 [21:02:16] (03CR) 10Dzahn: [C: 04-1] "i think you meant "ensure => 'link'". i'll merge it if you can fix that and the path conflict" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/119438 (https://bugzilla.wikimedia.org/62296) (owner: 10Tim Landscheidt) [21:03:27] (03PS3) 10Dzahn: toollabs: Add expect to exec nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125201 (owner: 10Yuvipanda) [21:05:35] (03CR) 10Dzahn: [C: 032] toollabs: Add expect to exec nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125201 (owner: 10Yuvipanda) [21:07:37] <^d> mutante: Actually I didn't use that script. [21:07:39] <^d> I did it by hand :) [21:08:55] ^d: heh :p ok [21:09:10] that patch is to start the discussion _how_ we're going to install it [21:09:31] see also that part about pcntl [21:10:26] <^d> pcntl we should already have with our standard php build. [21:11:24] ah! cool [21:11:28] see, it worked :) [21:11:35] that's something already [21:11:45] <^d> get_loaded_extensions() returns [22]=> string(5) "pcntl" on prod [21:11:50] :) [21:15:25] Waaakeee uuuupp Meeetaaa [21:15:54] * twkozlowski can't save stuff with the Translate extension on Meta [21:16:05] and translate memory doesn't work, either [21:17:59] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [21:19:49] (03CR) 10Dzahn: initial commit for a phabricator module (WIP) (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [21:29:11] RoanKattouw_away: about? [21:32:30] anyone here who knows about the edit updates that get blasted to ircd via udp? [21:33:08] <^d> It's ancient magic and we're trying to get rid of it. Sup? [21:34:20] (03CR) 10Rush: "fyi I spoke with robla yesterday, seems mukunda (a new person coming in as a release engineer) is lined up to 'puppetize phabricator' as h" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [21:55:23] (03PS1) 10Rush: gitreview [operations/debs/python-statsd] - 10https://gerrit.wikimedia.org/r/132521 [21:55:47] (03CR) 10Rush: [C: 032 V: 032] gitreview [operations/debs/python-statsd] - 10https://gerrit.wikimedia.org/r/132521 (owner: 10Rush) [22:00:24] (03PS2) 10Rush: initial debianization [operations/debs/python-statsd] - 10https://gerrit.wikimedia.org/r/131449 (owner: 10Gage) [22:06:39] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: Last successful Puppet run was Tue May 6 08:27:45 2014 [22:39:27] (03CR) 10Swalling: [C: 031] "We want to test this experiment on Beta Labs before enabling on enwiki, thanks!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132308 (owner: 10Robmoen) [23:32:03] chasemp: Did you have a question? [23:32:09] ^d: It's reliable! [23:32:33] Also, we're speaking over IRC right now. So... glass houses + stones, y'know. [23:33:22] I think I got what I need via email but I will circle back if not thanks [23:33:46] <^d> I lost too much scrollback when I detached earlier.