[00:00:50] !log unattended rolling restart of Elasticsearch cluster is going just fine - adding the 30 minute sleep between servers and turning down the replication rate makes it pretty boring. [00:00:55] Logged the message, Master [00:05:20] !log if anyone is reading the SAL for fun or sees an error in Elasticsearch cluster in the next 24 hours - we're performing an elasticsearch upgrade. We've set it up this time so its super slow and boring. So boring I'm going to sleep through it. If you see more then transient complaining from icinga about elasticsearch you can call me/have someone with access to the contact list call me. I expect icinga to complain about a [00:05:20] single node going down but I expect the cluster to stay "yellow" during the process- no alerts. [00:05:27] Logged the message, Master [00:06:08] * bd808 was crushed under a wall of SAL announcements [00:07:10] manybubbles: I guess this means I need to plan a logstash cluster update soon [00:07:42] !log bd808 needs to plan a logstash upgrade soon - let it be logged [00:07:48] Logged the message, Master [00:08:32] !log (manybubbles contd.) …a single node going down but I expect the cluster to stay "yellow" during the process- no alerts. [00:08:38] Logged the message, Master [00:36:16] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.113 [00:49:56] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC [01:07:37] PROBLEM - Disk space on elastic1005 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20086 MB (3% inode=99%): [01:59:56] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 23:59:40 UTC [02:00:16] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Thu Aug 21 02:00:10 UTC 2014 [02:10:17] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:10:47] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:11:46] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:12:16] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:20:30] !log LocalisationUpdate completed (1.24wmf16) at 2014-08-21 02:19:26+00:00 [02:20:39] Logged the message, Master [02:22:40] (03PS1) 10Yurik: Zero: to unified 401-01, new unified 631-02 [operations/puppet] - 10https://gerrit.wikimedia.org/r/155492 [02:23:07] bblack, when you have a sec - https://gerrit.wikimedia.org/r/155492 [02:28:07] enwiki db is locked [02:29:50] fixed now [02:31:17] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:31:47] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:32:31] (03CR) 10BBlack: [C: 032] Zero: to unified 401-01, new unified 631-02 [operations/puppet] - 10https://gerrit.wikimedia.org/r/155492 (owner: 10Yurik) [02:33:46] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:34:17] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:35:25] thx bblack ! [02:35:59] !log LocalisationUpdate completed (1.24wmf17) at 2014-08-21 02:34:56+00:00 [02:36:04] Logged the message, Master [02:37:01] part of the problem above (sodium) seems to be that the amanda backup job kills i/o perf [02:37:04] maybe we could ionice that? [02:38:17] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:40:07] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [02:40:37] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [02:40:37] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [02:50:56] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC [02:55:16] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2017: active_shards: 6050: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [03:21:53] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Aug 21 03:20:47 UTC 2014 (duration 20m 46s) [03:21:59] Logged the message, Master [04:00:56] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Thu 21 Aug 2014 02:00:10 UTC [04:51:56] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC [05:19:37] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Thu Aug 21 05:19:32 UTC 2014 [06:28:07] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:37] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:07] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:26] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:36] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:36] PROBLEM - puppet last run on db1003 is CRITICAL: CRITICAL: Puppet has 3 failures [06:45:06] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:45:07] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:45:26] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:45:37] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:47:07] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:48:36] RECOVERY - puppet last run on db1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:52:56] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC [06:52:56] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 685 MB (3% inode=94%): /var/lib/ureadahead/debugfs 685 MB (3% inode=94%): [07:05:07] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [08:15:36] good morning [08:41:26] PROBLEM - puppet last run on amslvs4 is CRITICAL: CRITICAL: Epic puppet fail [08:43:46] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Epic puppet fail [08:53:56] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC [09:00:26] RECOVERY - puppet last run on amslvs4 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:02:46] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:54:54] (03CR) 10Filippo Giunchedi: "yep that's correct!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155244 (owner: 10Filippo Giunchedi) [10:00:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift-drive-audit: import icehouse version [operations/puppet] - 10https://gerrit.wikimedia.org/r/155244 (owner: 10Filippo Giunchedi) [10:01:05] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] drive-audit: clear up exit status [operations/puppet] - 10https://gerrit.wikimedia.org/r/155245 (owner: 10Filippo Giunchedi) [10:02:46] !log Jenkins installed plugin [https://wiki.jenkins-ci.org/display/JENKINS/Throttle+Concurrent+Builds+Plugin Throttle Concurrent Builds]. [10:02:51] Logged the message, Master [10:13:34] lunch break [10:13:47] if Zuul / Jenkins is broken, just restart Jenkins on gallium.wikimedia.org [10:13:50] but it should be fine [10:19:56] (03CR) 10Filippo Giunchedi: "btw this is now https://bugs.launchpad.net/swift/+bug/1359664" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155245 (owner: 10Filippo Giunchedi) [10:20:47] ack hexmode [10:20:50] nope, hashar [10:29:09] just curious.. are mediawiki devs and/or wmf sysadmins able to view an individual user's notification logs (Special:Notifications)? [10:32:10] (03PS1) 10Filippo Giunchedi: filippo: use bashrc, fix prompt [operations/puppet] - 10https://gerrit.wikimedia.org/r/155518 [10:32:23] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] filippo: use bashrc, fix prompt [operations/puppet] - 10https://gerrit.wikimedia.org/r/155518 (owner: 10Filippo Giunchedi) [10:45:04] (03PS1) 10Filippo Giunchedi: swift: raise container availability thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/155522 [10:46:42] (03CR) 10Nikerabbit: Crreot another file reference from I11b85c87a (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/154999 (owner: 10Ori.livneh) [10:51:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: raise container availability thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/155522 (owner: 10Filippo Giunchedi) [10:54:56] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC [11:06:07] (03CR) 10Matanya: [C: 031] Add ferm::service rule for zookeeper admin port [operations/puppet] - 10https://gerrit.wikimedia.org/r/153801 (owner: 10Ottomata) [11:10:29] (03CR) 10Matanya: [C: 04-1] "Please remove templates/misc/email-blog-pageviews.erb as well." [operations/puppet] - 10https://gerrit.wikimedia.org/r/153117 (owner: 10Dzahn) [11:10:58] (03CR) 10Matanya: [C: 031] puppetmaster Apache template - retab [operations/puppet] - 10https://gerrit.wikimedia.org/r/153987 (owner: 10Dzahn) [11:11:11] (03PS2) 10Mark Bergsma: Allocate codfw public/private subnets for rows A-D [operations/dns] - 10https://gerrit.wikimedia.org/r/155266 [11:11:56] (03CR) 10Mark Bergsma: [C: 032] Allocate codfw public/private subnets for rows A-D [operations/dns] - 10https://gerrit.wikimedia.org/r/155266 (owner: 10Mark Bergsma) [11:15:37] (03PS1) 10Mark Bergsma: Add management IPs for asw-[a-d]-codfw [operations/dns] - 10https://gerrit.wikimedia.org/r/155524 [11:16:11] (03CR) 10Mark Bergsma: [C: 032] Add management IPs for asw-[a-d]-codfw [operations/dns] - 10https://gerrit.wikimedia.org/r/155524 (owner: 10Mark Bergsma) [11:16:26] (03CR) 10Matanya: [C: 031] delete blog SSL certificates [operations/puppet] - 10https://gerrit.wikimedia.org/r/153228 (owner: 10Dzahn) [11:16:52] godog: hi, do you plan to take another look at https://bugzilla.wikimedia.org/show_bug.cgi?id=69760 at some point? [11:17:20] (03CR) 10Matanya: [C: 031] salt - minion.erb - fix compiler warnings [operations/puppet] - 10https://gerrit.wikimedia.org/r/154347 (owner: 10Dzahn) [11:20:36] godog/andre__: Can GWT cause such problems? [11:20:43] (03CR) 10Matanya: [C: 031] sudoers.erb - deprecated variable access [operations/puppet] - 10https://gerrit.wikimedia.org/r/154372 (owner: 10Dzahn) [11:21:31] (03CR) 10Matanya: [C: 031] "just a comment, but oh well." [operations/puppet] - 10https://gerrit.wikimedia.org/r/154373 (owner: 10Dzahn) [11:30:15] PROBLEM - nutcracker port on mw1178 is CRITICAL: CRITICAL - Socket timeout after 2 seconds [11:37:44] PROBLEM - check if dhclient is running on mw1178 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:37:45] PROBLEM - nutcracker process on mw1178 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:38:05] PROBLEM - SSH on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:38:35] PROBLEM - puppet last run on mw1178 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:38:35] PROBLEM - RAID on mw1178 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:39:44] PROBLEM - Apache HTTP on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:42:05] PROBLEM - NTP on mw1178 is CRITICAL: NTP CRITICAL: No response from NTP server [11:48:24] PROBLEM - Host mw1178 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:34] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:53:18] (03PS1) 10Mark Bergsma: Typo [operations/dns] - 10https://gerrit.wikimedia.org/r/155531 [12:53:43] (03CR) 10Mark Bergsma: [C: 032] Typo [operations/dns] - 10https://gerrit.wikimedia.org/r/155531 (owner: 10Mark Bergsma) [12:55:24] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC [13:07:10] andre__: AFAICT it is working as expected after resolving the load issue, there's indeed still followup on how to detect if it happens again, I've pushed some nagios changes today to the swift thresholds [13:08:16] godog: ah, thanks! I'm tempted to copy that line to the bug report for the sake of communicating "yes, this is still on the screen", if you don't plan to comment there yourself :) [13:09:21] andre__: yep let me update the bug [13:09:27] thanks [13:37:45] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0633333333333 [13:51:44] PROBLEM - puppet last run on mw1151 is CRITICAL: CRITICAL: Puppet has 1 failures [14:01:30] looks like the Swift bug is back :) [14:03:04] PROBLEM - puppet last run on mw1072 is CRITICAL: CRITICAL: Epic puppet fail [14:04:04] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 712 MB (3% inode=94%): /var/lib/ureadahead/debugfs 712 MB (3% inode=94%): [14:08:03] PierreSelim: do you have deleted via ajax quick delete - (at) api error msg? [14:08:34] yes [14:08:44] RECOVERY - puppet last run on mw1151 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:08:46] I can try directly but I guess it will be the same [14:10:03] yes the same. [14:12:53] godog: it is not a new problem, one year ago there was teh same but (but not affected a lot of files. i can't find the bug, was closed as resolved :/ [14:13:28] (03PS1) 10Filippo Giunchedi: elasticsearch: increase ganglia timeout [operations/puppet] - 10https://gerrit.wikimedia.org/r/155541 [14:14:06] ^d: ^ [14:14:39] (03CR) 10Chad: [C: 031] "lgtm, especially when we want to know what graphs actually say :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155541 (owner: 10Filippo Giunchedi) [14:15:05] Steinsplitter: ack, sorry I'm in the middle of something now, anyways then it is a different bug than what was causing the problems yesterday [14:15:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] elasticsearch: increase ganglia timeout [operations/puppet] - 10https://gerrit.wikimedia.org/r/155541 (owner: 10Filippo Giunchedi) [14:15:56] for sure it's not new [14:17:44] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [14:22:04] RECOVERY - puppet last run on mw1072 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [14:24:03] (03PS1) 10Ottomata: kafkatee should consume from all partitions [operations/puppet] - 10https://gerrit.wikimedia.org/r/155543 [14:24:14] (03PS2) 10Ottomata: kafkatee should consume from all partitions [operations/puppet] - 10https://gerrit.wikimedia.org/r/155543 [14:24:27] (03CR) 10Ottomata: [C: 032 V: 032] kafkatee should consume from all partitions [operations/puppet] - 10https://gerrit.wikimedia.org/r/155543 (owner: 10Ottomata) [14:25:15] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [14:30:13] (03PS16) 10Andrew Bogott: mailman: use a new default theme (prettier mailman) [operations/puppet] - 10https://gerrit.wikimedia.org/r/154964 (https://bugzilla.wikimedia.org/61283) (owner: 10John F. Lewis) [14:30:41] (03PS1) 10Hashar: contint: phpcs mw standard on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/155544 (https://bugzilla.wikimedia.org/64858) [14:31:38] (03CR) 10Andrew Bogott: [C: 032] mailman: use a new default theme (prettier mailman) [operations/puppet] - 10https://gerrit.wikimedia.org/r/154964 (https://bugzilla.wikimedia.org/61283) (owner: 10John F. Lewis) [14:35:11] (03CR) 10Hashar: [C: 031] "Cherry picked on contint puppetmaster." [operations/puppet] - 10https://gerrit.wikimedia.org/r/155544 (https://bugzilla.wikimedia.org/64858) (owner: 10Hashar) [14:35:31] andrewbogott: thanks for the merge, how long do you think puppet would take to run? [14:36:08] Thehelpfulone: everything should be updated after 30 mins [14:36:38] !log Jenkins: updating mediawiki code sniffer repo bf82117..bc4e590 [14:36:45] Logged the message, Master [14:38:16] andrewbogott: Thanks! :) [14:50:14] PROBLEM - mailman list info on sodium is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string About Wikimedia-l not found on https://lists.wikimedia.org:443/mailman/listinfo/wikimedia-l - 15083 bytes in 0.107 second response time [14:53:11] (03PS1) 10Ottomata: Add cron job to drop old data in HDFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/155549 [14:53:28] (03PS2) 10Ottomata: Add cron job to drop old data in HDFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/155549 [14:53:55] !log Jenkins: updated PHP CodeSniffer MediaWiki standard on all slaves. [14:54:01] Logged the message, Master [14:56:24] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC [14:58:20] hm, mailman HTTP cirital? [14:58:24] Thehelpfulone ^^ [14:58:38] hmm? [14:58:47] I'm looking at icinga now [15:03:05] RECOVERY - Unmerged changes on repository puppet on virt0 is OK: No changes to merge. [15:03:07] andrewbogott: check Sodium's logs please (or anyone which access) [15:03:41] Mailman hit a bug and I don't understand why with regards to the patch deployed above [15:06:18] Still no jouncing. But nothing requested for SWAT this morning. [15:08:00] JohnLewis: someone merged a change to prettify mailman [15:08:12] seems like it broke the check [15:08:13] matanya: I know; it's my patch [15:08:24] but the site is ok [15:08:26] Mailman admin interfaces are no longer working [15:08:55] Thehelpfulone: can you log in ? [15:09:23] where I'm already logged in that's ok [15:09:24] https://lists.wikimedia.org/mailman/admin/wikimedia-l [15:09:28] Bug in Mailman version 2.1.13 [15:09:28] We're sorry, we hit a bug! [15:09:29] Please inform the webmaster for this site of this problem. Printing of traceback and other system information has been explicitly inhibited, but the webmaster can find this information in the Mailman error logs. [15:09:31] :S [15:09:52] * JohnLewis gets to test that patch on labs [15:11:32] Thehelpfulone: let's hope puppet works on labs this time :/ [15:13:41] JohnLewis: I'm back, looking... [15:13:58] andrewbogott: i see why the alert is raised [15:14:07] no "About Wikimedia-l" test on the page [15:14:21] i can adjust the check, would that be useful ? [15:14:41] matanya: not really because that won't fix the admin interface :) [15:14:53] but it will fix the nagios alert [15:15:02] matanya: feel free then I guess [15:15:09] Except the nagios alert is correct, isn't it? [15:15:13] Or are they somehow unrelated? [15:15:15] it is not [15:15:18] JohnLewis: what should I look for on sodium? [15:15:24] Ah, ok, then feel free to fix :) [15:15:33] the nagios alert checks to see the web ui is there [15:15:53] by searching for a string on the web page [15:15:59] JohnLewis: IOError: [Errno 2] No template file found: 'verify.txt' [15:15:59] which was removed [15:16:12] Hm? [15:16:44] andrewbogott: can you grep that in the templates directory and see where it exists? [15:16:59] I think it might be an issue it doesn't exist in some directories and I think I might know which ones [15:17:14] JohnLewis: which templates dir? [15:17:32] And, ok, just so we're on the same page... [15:17:33] /var/lib/mailman/ and there should be a 'templates' directory [15:17:40] grep there :) [15:17:44] we don't want wikimedia-l to be removed, right? so the fact that it's missing, that's the problem? [15:18:11] root@sodium:/var/lib/mailman/templates# grep -ir wikimedia-l * [15:18:12] root@sodium:/var/lib/mailman/templates# [15:18:33] (03PS1) 10Matanya: mailman: search for a new string on new web ui [operations/puppet] - 10https://gerrit.wikimedia.org/r/155563 [15:18:35] oh, sorry, verify.txt [15:18:58] no reference to that either. [15:20:03] andrewbogott: that patch should fix the nagios alert [15:20:19] andrewbogott: no where at all? [15:20:24] * JohnLewis checks old labs instance [15:20:42] root@sodium:/var/lib/mailman/templates# grep -ir verify.txt * [15:20:43] root@sodium:/var/lib/mailman/templates# find . -name verify.txt [15:20:43] root@sodium:/var/lib/mailman/templates# [15:23:18] verify.txt does not exist on labs either [15:23:48] Shall I just revert the theming patch? Or do you have some idea as to what's happening? [15:24:50] JohnLewis: the failures are all over. Like, most recently No template file found: 'postheld.txt' [15:25:06] No template file found: 'article.html' [15:25:09] I think I know what is wrong [15:25:10] etc [15:25:55] yeah I know what is wrong [15:26:08] andrewbogott: is sodium backed up anywhere? [15:26:15] ... [15:26:29] I would prefer you not ask things like that! [15:26:29] actually; any mailman thing will do actually [15:26:37] I don't know. [15:26:45] Um… mchenry used to host mailman I think. [15:26:52] I dont' know if it still exists. [15:27:08] this is why you guys need to check puppet stuff [15:27:38] I think the bit I added to puppet overwrites the whole directory removing all files.. [15:28:29] uh ohhh, what's happenin? [15:28:56] recurse remote, you mean? [15:28:56] andrewbogott: look at the puppet part I wrote and see if it does that, I don't think it did? [15:30:07] andrewbogott: listinfo.html needed to be added to the directories overwriting the listinfo, what happened here is it added the file and delete *everything* else in the directory [15:30:36] docs say "Allows copying of a few files into a directory containing many unmanaged files without scanning all the local files. " which sounds like what you wanted. [15:30:40] Not that it seems to have done that... [15:30:51] Yes [15:30:58] and it didn't apparently. Hm. [15:31:16] At least we've identified the issue and what casued it. Now we need to resolve it manually and automatically. [15:31:59] lemme verify that that's happening... [15:32:05] please [15:32:16] JohnLewis: this is why you guys need to check puppet stuff <-- please don't blame others for your mistake :) [15:32:43] andrewbogott: i think this is because templates was previsouly a symlink [15:32:46] i think we're ok [15:32:56] ottomata: oh? What was it a link to? [15:32:58] matanya: :) [15:33:05] /etc/mailman [15:33:10] ls /etc/mailman/en [15:33:16] so, the puppet change did what it was suppsoed to do [15:33:25] but it changed /var/lib/mailmain/templates to a directory [15:33:30] and then recurse copied the stuff from puppet there [15:33:36] all the old templates are still in /etc/mailman [15:33:47] ok, so a proper fix is to point that patch at /etc/mailman instead of /var/lib/mailman/templates [15:33:50] matanya: wasn't my mistake though; something was undocumented that I didn't know abvout or missed ;) [15:34:01] And switch things back to a symlink, presumably by hand [15:34:10] or via puppet, wouldn't hurt [15:34:13] ensure => link [15:34:14] andrewbogott / ottomata: Want to make a patch reversing it now? [15:34:20] yep, hang on [15:34:28] andrewbogott: i would use a link by puppet [15:34:51] verified: A local file in that dir is /not/ clobbered by recurse. [15:36:15] RECOVERY - mailman list info on sodium is OK: HTTP OK: HTTP/1.1 200 OK - 10511 bytes in 0.116 second response time [15:37:48] (03PS1) 10Andrew Bogott: /var/lib/mailman/templates is a link to /etc/mailman [operations/puppet] - 10https://gerrit.wikimedia.org/r/155564 [15:38:11] matanya, ottomata, happy with that? [15:38:58] (03CR) 10Matanya: [C: 031] /var/lib/mailman/templates is a link to /etc/mailman [operations/puppet] - 10https://gerrit.wikimedia.org/r/155564 (owner: 10Andrew Bogott) [15:39:12] (03CR) 10John F. Lewis: [C: 031] "Perfect fix to my stupidity :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155564 (owner: 10Andrew Bogott) [15:39:59] ottomata: out of interest; was this symlink documented anywhere? :) [15:40:14] i guess in ls -la [15:40:18] Yeah, given that that symlink wasn't puppetized, I don't think there was any way for you to know. [15:40:19] i think that's fine, andrewbogott, i'd add a comment explicitly saying what the intention here is. the recurse => remote isn't very obvious [15:40:28] JohnLewis: dunno, doubtful [15:40:44] the only reason I knew it was a symilnk was because you had me look into this the other day [15:40:56] and puppet logs :) [15:40:56] Aug 21 14:48:01 sodium puppet-agent[3598]: (/Stage[main]/Mailman::Web-ui/File[/var/lib/mailman/templates]/ensure) ensure changed 'link' to 'directory' [15:41:07] :p [15:41:09] ideally someone should have manually run puppet and watched the logs whne they merged that change [15:41:22] ottomata: recommented [15:41:24] (03PS2) 10Andrew Bogott: /var/lib/mailman/templates is a link to /etc/mailman [operations/puppet] - 10https://gerrit.wikimedia.org/r/155564 [15:41:33] ottomata: yeah, that would've been me. [15:41:59] s'ok ! looks good cept: [15:42:00] Recurse => remove [15:42:29] running in the puppet compiler would reveal that too [15:42:53] (03PS1) 10Reedy: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155566 [15:42:55] (03PS1) 10Reedy: testwiki to 1.24wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155567 [15:42:57] (03PS1) 10Reedy: Wikipedias to 1.24wmf17 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155568 [15:42:59] (03PS1) 10Reedy: group0 to 1.24wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155569 [15:43:06] eek [15:43:12] andrewbogott: it is recurse -> remote [15:43:17] not remove [15:43:18] (03CR) 10Reedy: [C: 032] Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155566 (owner: 10Reedy) [15:43:22] (03Merged) 10jenkins-bot: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155566 (owner: 10Reedy) [15:43:29] (03PS3) 10Andrew Bogott: /var/lib/mailman/templates is a link to /etc/mailman [operations/puppet] - 10https://gerrit.wikimedia.org/r/155564 [15:46:12] (03CR) 10Ottomata: [C: 032] /var/lib/mailman/templates is a link to /etc/mailman [operations/puppet] - 10https://gerrit.wikimedia.org/r/155564 (owner: 10Andrew Bogott) [15:46:33] andrewbogott: want to merge and run puppet? (i made a copy of /etc/mailman in /tmp/ just in case!) [15:46:43] heh,I copied it too :) [15:46:52] I'll merge [15:48:46] (03PS3) 10Ottomata: Add cron job to drop old data in HDFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/155549 [15:50:15] PROBLEM - mailman list info on sodium is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string About Wikimedia-l not found on https://lists.wikimedia.org:443/mailman/listinfo/wikimedia-l - 15083 bytes in 0.192 second response time [15:50:46] andrewbogott: if you can please merge my change to fix this ^ [15:51:11] matanya: which one? [15:51:20] mailman worked and now has gone back to the bug issue [15:51:26] https://gerrit.wikimedia.org/r/155563 [15:52:21] ottomata: ^ [15:53:04] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [15:53:43] so… JohnLewis, Thehelpfulone, happy with how things look now? [15:54:24] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Puppet has 1 failures [15:54:37] (03CR) 10Andrew Bogott: [C: 032] mailman: search for a new string on new web ui [operations/puppet] - 10https://gerrit.wikimedia.org/r/155563 (owner: 10Matanya) [15:54:39] andrewbogott: yeah :) [15:54:48] andrewbogott: we only had to kill mailman to get it ;) [15:55:04] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [15:55:24] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:00:18] RECOVERY - mailman list info on sodium is OK: HTTP OK: HTTP/1.1 200 OK - 15083 bytes in 0.099 second response time [16:03:57] thanks andrewbogott [16:04:10] thank you, and ottomata. [16:13:40] (03PS1) 10Mark Bergsma: Add access switch stack asw-d-codfw to RANCID [operations/puppet] - 10https://gerrit.wikimedia.org/r/155574 [16:15:05] JohnLewis: want to try again with the announcement email? ;) [16:15:18] Thehelpfulone: was it not queued? [16:15:26] nope didn't come through [16:15:27] Thehelpfulone: are you also going to privately email the list owners please? [16:15:39] Nemo_bis: yes, to every list [16:15:49] that's what the announcement email is [16:16:08] Thehelpfulone: every list members? :o [16:16:17] every list -owner email address [16:16:24] ok; owners is enough [16:16:56] sent [16:17:28] 'pending moderator approval' Thehelpfulone: just give me the password to the list :p [16:17:29] (03CR) 10Mark Bergsma: [C: 032] Add access switch stack asw-d-codfw to RANCID [operations/puppet] - 10https://gerrit.wikimedia.org/r/155574 (owner: 10Mark Bergsma) [16:17:51] you forgot to cc me ;) [16:17:59] on you did nvm :p [16:18:35] sent [16:19:05] JohnLewis: so now that it's updated, the list info pages should be able to be updated relatively easily right (I'm thinking of move of the translate link to the sidebar) [16:19:30] hmm, This is a closed translation request. [16:19:40] ok so maybe we remove that link altogether then? Nemo_bis [16:19:47] RECOVERY - Disk space on elastic1005 is OK: DISK OK [16:21:37] Thehelpfulone: dunno; it's your only attribution to translators isn't it [16:21:46] yeah [16:23:31] why did we close it Nemo_bis? are we actually done? [16:23:39] I thought we were going to do another sync in a few weeks? [16:23:44] ohnoes this discussion again [16:23:59] if you are, say so on talk and re-enable :) [16:24:42] heh [16:26:01] hmm which languages were they again [16:27:08] JohnLewis: got 2 of them [16:27:16] Nemo_bis: is there a list of the lang codes for the ones Vogone restricted to? Arabic, Catalan, Czech, Danish, German, English, Spanish, Estonian, Basque, Persian, Finnish, French, Hindi, Croatian, Hungarian, Interlingua, Indonesian, Italian, Japanese, Korean, Lithuanian, Norwegian Bokmål, Dutch, Polish, Portuguese, Brazilian Portuguese, Romanian, Russian, Slovenian, Serbian, Swedish, Tamil, Turkish, Ukrainian, Vietnamese, [16:27:16] Chinese (China), Simplified Chinese, Traditional Chinese and Chinese (Taiwan) [16:30:38] Thehelpfulone: probably in the json or db representation of the log [16:37:07] ganglia question: recently the elasticsearch cluster has seen choppy ganglia stats, like no stats at all http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Elasticsearch%20cluster%20eqiad&m=disk_free&r=4hr&s=by%20name&hc=3&mc=2&st=1408638854&g=load_report&z=large [16:37:44] it looks like some sort of packet loss where [16:37:57] where multicast data wouldn't make it to the aggregator from the nodes [16:38:12] did we see this before? [16:43:34] hm, godog, i've not seen that happen on the base ganglia metrics before, [16:43:47] i've had problems with custom stats disappearing from time to time [16:45:26] andrewbogott: so the one thing that's come up so far is that the templates aren't saved as UTF-8 https://lists.wikimedia.org/mailman/listinfo/wikimedia-ru [16:45:53] which is causing problems with Russian for example [16:46:00] Yeah, that seems serious. [16:46:07] do you know how we can re-save them as UTF-8? [16:46:12] ottomata: ye there's custom stats involved, it could be gmond too, I'll keep diggin thanks ! [16:46:31] Thehelpfulone: When you say 'the templates' you mean the new ones that were just added? [16:46:40] I think so yes [16:47:00] Most likely the files are correct and mailman just doesn't know how to interpret them [16:47:27] might be that you just need a flag or mark at the top of the files... [16:48:04] that template file has up top [16:49:06] Thehelpfulone: could it be that the new theme uses a new font? [16:49:36] yeah I added that andrewbogott but apparently just because it has that it doesn't mean the html documents /are/ actually UTF-8 [16:50:14] indeed it doesn't [16:56:47] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC [17:13:42] (03CR) 10Ottomata: "Ok, seems fine. Why does the CNAME not work though?" [operations/dns] - 10https://gerrit.wikimedia.org/r/154222 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [17:17:05] godog: you are assigned on this ticket [17:17:06] https://rt.wikimedia.org/Ticket/Display.html?id=8071 [17:17:18] are you planning on doing something with it? [17:18:22] ottomata: yep, likely next week tho [17:22:27] ok cool [17:22:28] no probs [17:24:06] (03PS1) 10Cmjohnson: Removing mw1130 from dsh files to replace disk and re-install [operations/puppet] - 10https://gerrit.wikimedia.org/r/155584 [17:28:55] !log removing mw1130 from pybal [17:29:01] Logged the message, Master [17:30:24] (03CR) 10Cmjohnson: [C: 032] Removing mw1130 from dsh files to replace disk and re-install [operations/puppet] - 10https://gerrit.wikimedia.org/r/155584 (owner: 10Cmjohnson) [17:39:07] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 712 MB (3% inode=94%): /var/lib/ureadahead/debugfs 712 MB (3% inode=94%): [18:13:09] (03PS2) 10Dzahn: remove blog.wikmedia.org related things [operations/puppet] - 10https://gerrit.wikimedia.org/r/153117 [18:15:35] (03PS3) 10Dzahn: remove blog.wikmedia.org related things [operations/puppet] - 10https://gerrit.wikimedia.org/r/153117 [18:17:22] no deploy? [18:19:52] (03CR) 10Ottomata: "Cool, cronjob will need removed from stat1003 too. I assume Tilman knows about this? I think he was the one who originally wanted those " [operations/puppet] - 10https://gerrit.wikimedia.org/r/153117 (owner: 10Dzahn) [18:21:53] (03CR) 10Dzahn: "well, i did not speak to Tilman about this specific patch, but i'm quite sure Tilman knows we don't host the blog anymore and either he is" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153117 (owner: 10Dzahn) [18:22:44] (03CR) 10Dzahn: [C: 031] "maybe RobH can review this too being involved in a lot of blog things in the past" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153117 (owner: 10Dzahn) [18:23:49] (03CR) 10Dzahn: "also see Change-Id: I682cb816e08d8" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153117 (owner: 10Dzahn) [18:27:57] !log trying to recover from weird Elasticsearch upgrade failure by redoing the upgrade on one node while also blowing away the data directory during the upgrade. elastic1005, you are my first victem. [18:28:03] Logged the message, Master [18:28:11] !log *victim* [18:28:17] Logged the message, Master [18:32:47] (03CR) 10Anomie: [C: 04-1] "I see no evidence of support for this change at this time. See https://bugzilla.wikimedia.org/show_bug.cgi?id=67709#c8 for details." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154408 (https://bugzilla.wikimedia.org/67709) (owner: 10Jforrester) [18:43:06] (03PS1) 10Aude: Use new Wikibase serialization format on test wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155596 [18:45:11] robh: let's delete the old paging script for the usb device? https://gerrit.wikimedia.org/r/#/c/153227/ [18:46:30] (03CR) 10Dzahn: "Ori, should class { 'apache': be completely removed in this one?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153843 (owner: 10Dzahn) [18:57:47] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC [19:02:25] Reedy: ^d no deploy? [19:03:02] <^d> I wasn't asked about today. [19:03:03] <^d> Hmm [19:03:05] jouncebot: !next [19:03:12] ok [19:03:22] no hurry [19:03:38] jouncebot is broken for three days now [19:03:40] or four [19:04:20] schedule a meta deployment of jouncebot [19:04:49] <^d> !jouncebot schedule your own upgrade [19:05:40] Guessing based on the SAL no one has attempted my deploy ;) [19:06:05] RoanKattouw: You should really stop going in via fenari [19:06:09] Shouldn't you be using iron? :) [19:06:35] Yes, I should be :) [19:06:41] (03CR) 10Reedy: [C: 032] testwiki to 1.24wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155567 (owner: 10Reedy) [19:06:45] (03Merged) 10jenkins-bot: testwiki to 1.24wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155567 (owner: 10Reedy) [19:07:27] catrope pts/3 fenari.wikimedia 12Aug14 43:47m 0.84s 0.84s -bash [19:07:33] Ghost session... [19:07:34] hey mark, do you happen to know if anything sets x-forwarded-for headers other than nginx ssls? [19:07:41] Oh ugh [19:07:56] I guess that might be screen? [19:07:56] ottomata: well, lots of clients (other proxies out there) do [19:08:01] Or maybe just an unclean disconnect [19:08:22] there's a SCREEN -DR for you on fenari [19:08:35] 5 lots of bash [19:08:37] ssh to tin [19:08:56] hm, ok, but there shouldn't ever be an IP before an nginx SSL IP in x-forwarded in the varnish logs if the request went through the ssl terminator [19:08:57] right? [19:09:22] i.e. there shouldn't be anything between nginx and varnish that modifies x-forwaded for [19:09:28] reedy|webirc: Yeah I should stop using fenari as my bastion [19:09:41] :) [19:09:46] I'm sure you'll make mutante happy [19:09:47] i'm seeing if I can id SSL requests based on IP in x-forwaded-for [19:09:58] !log reedy Started scap: testwiki to 1.24wmf18 [19:10:03] Logged the message, Master [19:12:06] (03PS1) 10Ori.livneh: Add 'trebuchet' package provider. [operations/puppet] - 10https://gerrit.wikimedia.org/r/155603 [19:17:15] hey everyone. :) [19:17:16] was trying to upload a file to office wiki using Special:Upload. [19:17:18] got an error that said: "An unknown error occurred in storage backend "local-swift-eqiad"." [19:17:19] pm'ed marktraceurWMF but he asked me to holler here instead [19:17:21] what should i do differently? [19:17:22] would like to get meeting minutes out to folks... [19:17:24] hoping you could help me! thanks in advance! :) [19:17:38] mark, another q, I don't suppose you know of a way to get a list of all SSL terminator IPs, eh? [19:18:02] i think that'd include standalone nginx boxes...as well as varnish boxes that also run ssl (we have those, right?) [19:18:03] check pybal :) [19:18:09] yes [19:18:12] eqiad/esams are separate [19:18:15] oo [19:18:16] godog: About? [19:18:17] ulsfo does ssl on the varnish boxes [19:18:36] AnnaKoval: I've got a feeling it might've been broken a little while. Maybe [19:18:45] awesome, got it, got hosts there cool [19:19:17] (03CR) 10Dzahn: "where is role::download::mediawiki even used? Ariel?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153817 (owner: 10Dzahn) [19:19:37] reedy|webirc: sure, what's up? [19:19:39] reedy|webirc well that's a relief in that *i* didn't break it! :) [19:20:02] * godog reading backlog [19:20:11] godog: uploads on office wiki -> swift error [19:20:16] ottomata: in practice, the SSL proxy's IP can be in the middle of the list [19:20:40] ottomata: there can be external XFF-setters before SSL. But it should be the first IP in the XFF list that belongs to one of our networks, if that works. [19:20:44] do proxies insert themselves at the end or the beginning? [19:21:00] unfortunately the way XFF works it kinda retarded [19:21:18] proxies don't insert themselves, they append the client they forwarded it for, which is the previous proxy/client in the chain [19:21:26] reedy|webirc does that mean it'll likely stay broken a while longer? how can i upload files then if not with special:upload? [19:21:32] that'd be https://bugzilla.wikimedia.org/show_bug.cgi?id=69760 I think [19:22:17] godog: Shit error messages are shit :) [19:22:30] reedy|webirc: indeedly [19:22:36] a classic complex case would be a cell phone user with opera mini via https in europe, the XFF list would look like: , , , , [19:22:36] ah right right [19:22:47] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0166666666667 [19:22:49] AnnaKoval: I presume you've tried a couple of times? [19:22:51] oh right the frontend/backend stuff [19:22:56] hm [19:22:57] something in swift is broken. but nothing new - only the last hours a lot of such problems. [19:22:57] yes [19:23:05] same error every time [19:23:11] godog: AnnaKoval , i just uploaded something and it worked for me [19:23:23] ok i'll try another browser [19:23:28] thanks mutante [19:23:29] https://office.wikimedia.org/wiki/File:Bugzilla-logo.png [19:23:30] AnnaKoval: What file type? [19:23:31] !log mw1019 returned [127]: bash: sync-common: command not found [19:23:37] .png [19:23:37] Logged the message, Master [19:23:44] well, i guess IF the ssl ip is in the list...then I can be pretty sure it was a proxied ssl request [19:23:44] !log mw1178 returned [255]: ssh: connect to host mw1178 port 22: Connection timed out [19:23:49] Logged the message, Master [19:23:57] ottomata: Or IPv6 [19:24:00] mutante: yep aaron and myself were looking at that and there were some timeouts of swift talking to memcache [19:24:11] bwah, right some nginxes proxy for that too [19:24:16] IIRC the SSL proxies and the IPv6 proxies are on the same boxes [19:24:16] hmmmm, that's ok in this case [19:24:29] yeah, that's ok [19:24:33] AnnaKoval: is it a big file or anything? [19:24:57] i'm trying to build logic to throw out internal requests to varnish UNLESS it was a real proxied request [19:25:03] 131 kb [19:25:03] which would include IPv6 [19:25:06] is that big?? [19:25:06] Looks like mw1019 is known/under work [19:25:11] ottomata: yeah but in the ulsfo case, the same IP address will be XFF for the varnish-front (not-ssl) or for SSL [19:25:17] I think? [19:25:21] AnnaKoval: Nope [19:25:25] oof, hm [19:25:30] thanks for your help reedy|webirc :) [19:25:32] that is a complication [19:25:36] Someone want to poke/kick mw1178? [19:25:44] ottomata: Oh so maybe all you need is to look for a non-WMF IP in the XFF then? [19:25:58] ottomata: what do you actually need to figure out, exactly, based on the header? [19:26:14] so [19:26:21] udp2log contains logs from nginx + varnish [19:26:27] kafka has only logs from varnish [19:26:51] you want to id requests in kafka that came in via our ssl terminators? [19:26:56] webstatscollector (which generates dumps.wikimedia.org pageview counts) de-duplicates the proxied requests by throwing out ALL requests where request IP is internal [19:27:15] !log restarted memcached on ms-fe1004 [19:27:20] Logged the message, Master [19:27:26] how do they define the "request IP"? the first address in XFF? [19:27:33] yes, if i wanted to run webstatscollector on the kafka stream, i still want to throw out all internal requests, but keep any of those that are real proxied requests [19:27:53] nono, the remote_addr [19:28:18] Not sure if helps: I have reported it one year ago (the bug was closed as fixed, cant find the link) the same swift problems when moving/deleting/uploading files. This is nothing new. [19:28:19] remote_addr at which stage of the request chain? [19:28:28] remote_addr at the varnish frontend? [19:28:33] in udp2log, any [19:28:37] udp2log contains both nginx and varnish logs [19:28:45] so, webstatscollector jsut discrads where remote_addr is internal [19:28:46] <3 thanks godog [19:28:54] remote_addr is going to have very different meanings if you're looking at different layers (nginx, varnishes, apaches) [19:28:57] which i believe would discard the proxied varnish logs [19:29:00] but keep the nginx ones [19:29:07] oh [19:29:10] at varnish frontend. [19:30:20] kafka has logs from varnish-frontend (and just uhh, 'varnish' for bits?) [19:30:21] is this point of this to filter ssl-terminated vs not? or to filter all truly-external traffic vs internally-generated (e.g. when something down in apache or deeper goes back up to the front and makes an API request)? [19:30:22] udp2log has that + nginx [19:31:05] the ultimate point right now is to make webstatscollector output the same stats, but without using the logs from nginx [19:31:49] !log disabled mw1178 in pybal [19:31:51] reedy|webirc: ^ [19:31:56] Logged the message, Master [19:32:06] mutante: thanks. is it proper deaded? Should I remove from dsh too? [19:32:24] dunno yet, cant login though, and also not on mgmt [19:33:03] so, bblack, filter truly external traffic [19:33:31] ACKNOWLEDGEMENT - Apache HTTP on mw1178 is CRITICAL: Connection timed out daniel_zahn disabled in pybal [19:34:15] (03PS1) 10Reedy: Remove mw1178 from mediawiki-installation, deaded [operations/puppet] - 10https://gerrit.wikimedia.org/r/155610 [19:34:26] mutante: ^^ there if you need it ;) [19:34:33] ottomata: if you just want to identify requests that hit the nginx SSL terminators, check X-Forwarded-Proto. We set that when it comes through nginx and wipe it first if an outsider set it. [19:34:43] reedy|webirc: i also dont know why mw1019 does not have sync-common [19:35:04] mutante: 09:58 godog: depool mw1019 from appservers, testing trusty+hhvm reinstall RT #8153 [19:35:07] I'm guessing it's broken :) [19:35:07] usually it turns out to be some kind of test box, but no ticket for it [19:35:11] hah [19:35:18] it's hhvm [19:35:22] if that matters [19:35:41] #{'host': 'mw1019.eqiad.wmnet', 'weight': 10, 'enabled': True } # HHVM, RT #8153 [19:36:02] not a big deal then [19:36:06] ack [19:36:44] AnnaKoval: no problem [19:36:59] oh it gets wiped by nginx, bblack? [19:37:20] ottomata: so, "truly external traffic" from varnish-frontend's perspective would be "Anything with an non-WMF REMOTE_ADDR, plus anything with X-F-P set to 'https'", *if* we assume that internal requests never use SSL [19:37:39] ottomata: it gets wiped by varnish if the remote addr doesn't belong to the list of our nginx proxies [19:37:44] reedy|webirc mutante yeah known issue :( will be fixed tomorrow or earlier cc: ori [19:38:07] so bblack, in other words [19:38:23] if remote_addr is internal AND XFF is set, then keep? [19:38:27] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.139 [19:38:32] if for some reason there are internal SSL requests, then the simpler route to go would be "Anything with a non-WMF REMOTE_ADDR, plus anything with a non-WMF address as the very first entry in XFF (left-most)" [19:38:50] i think i'm ok with not worrying about that edge case [19:38:53] this doesn't ahve to be exact [19:38:59] (03CR) 10Dzahn: [C: 032] "dead, can't connect at all" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155610 (owner: 10Reedy) [19:39:04] I have no idae if it's an edge case or not, someone else might :) [19:39:08] heh [19:39:24] but the second one, if you can do it, is completely reliable [19:39:31] right now we mainly want to use this as a comparison to the udp2log output, to make sure we are mostly the same with the kafka sream [19:39:42] "non-WMF REMOTE_ADDR *or* WMF REMOTE_ADDR + non-WMF first entry in XFF" [19:39:47] hm, checking left-most XFF should be easy [19:40:00] Anything known up with mw1017? [19:40:07] scap-rebuild-cdbs: 100% (ok: 225; fail: 2; left: 1) [19:40:11] testwiki hasn't rolled over yet [19:40:13] hmmm [19:40:27] the only case that breaks that rule would be if a stupid user decided to screw with you and manually specify a WMF-internal IP address in his XFF header and send that traffic to us [19:40:46] ok coooool [19:40:47] i think I can do that [19:40:48] thanks bblack [19:40:49] but people like that are crazy enough that it's rare they can operate a keyboard [19:40:56] hah [19:40:56] (03CR) 10Jeremyb: "The patch intends to maintain status quo for MX and change everything else to point to the main cluster. (in particular the goal is to mov" [operations/dns] - 10https://gerrit.wikimedia.org/r/154222 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [19:40:57] reedy|webirc: also hhvm? [19:41:04] * aude gets confused [19:41:25] Special:Version says so [19:41:38] `php -v` says so too [19:41:40] scap-rebuild-cdbs is hanging [19:41:43] https://test.wikidata.org/wiki/Special:Version says no! [19:41:50] still broken for a week [19:42:07] anyway, shall poke at that another day [19:42:07] https://test.wikipedia.org/wiki/Special:Version [19:42:08] WFM [19:43:03] Oh [19:43:03] I'm tired [19:43:13] scap-rebuild-cdbs isn't wikiversions [19:43:27] this is true [19:43:30] ACKNOWLEDGEMENT - Host mw1178 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #8173 [19:43:35] THAT'S BETTER [19:44:00] !log reedy Finished scap: testwiki to 1.24wmf18 (duration: 34m 01s) [19:44:05] Logged the message, Master [19:44:10] 19:43:59 1 hosts had sync_wikiversions errors [19:44:34] (03CR) 10Reedy: [C: 032] Wikipedias to 1.24wmf17 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155568 (owner: 10Reedy) [19:44:38] (03Merged) 10jenkins-bot: Wikipedias to 1.24wmf17 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155568 (owner: 10Reedy) [19:45:27] 3 Fatal error: Class 'LuaSandbox' not found in /usr/local/apache/common-local/php-1.24wmf17/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 17 [19:45:32] Bets on it being one box? :) [19:45:56] Yup, 10.64.0.47 [19:46:09] oh look, mw1017 [19:46:09] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf17 [19:46:15] Logged the message, Master [19:47:16] fyi marktraceurWMF and reedy|webirc and godog -- special:upload worked in firefox but still failed in chrome. appreciate your help. <3 [19:47:28] o_0 [19:47:52] (03CR) 10Reedy: [C: 032] group0 to 1.24wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155569 (owner: 10Reedy) [19:47:56] (03Merged) 10jenkins-bot: group0 to 1.24wmf18 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155569 (owner: 10Reedy) [19:49:49] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf18 [19:49:52] (03PS1) 10Filippo Giunchedi: elasticsearch: restore 2s timeout for ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/155614 [19:49:55] Logged the message, Master [19:49:59] ^d: ^ [19:50:38] * aude wants https://gerrit.wikimedia.org/r/#/c/155596/ deployed for test.wikidata [19:50:40] (03CR) 10Chad: [C: 031] elasticsearch: restore 2s timeout for ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/155614 (owner: 10Filippo Giunchedi) [19:51:01] godog: admins complian about errors when trying to delete file [19:51:08] (03PS2) 10Reedy: Use new Wikibase serialization format on test wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155596 (owner: 10Aude) [19:51:13] (03CR) 10Reedy: [C: 032] Use new Wikibase serialization format on test wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155596 (owner: 10Aude) [19:51:14] yay, thanks [19:51:17] (03Merged) 10jenkins-bot: Use new Wikibase serialization format on test wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155596 (owner: 10Aude) [19:51:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] elasticsearch: restore 2s timeout for ganglia [operations/puppet] - 10https://gerrit.wikimedia.org/r/155614 (owner: 10Filippo Giunchedi) [19:51:43] (03PS3) 10Reedy: Enable webfonts in English Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155206 (https://bugzilla.wikimedia.org/69655) (owner: 10KartikMistry) [19:51:53] matanya: hey! [19:51:54] (03CR) 10Reedy: [C: 032] Enable webfonts in English Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155206 (https://bugzilla.wikimedia.org/69655) (owner: 10KartikMistry) [19:51:58] (03Merged) 10jenkins-bot: Enable webfonts in English Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155206 (https://bugzilla.wikimedia.org/69655) (owner: 10KartikMistry) [19:52:03] matanya: same issue as the bug above? [19:53:11] (03PS3) 10Reedy: Add botadmin user group on fa.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154126 (https://bugzilla.wikimedia.org/69411) (owner: 10Calak) [19:53:18] godog: unknown error: eqiad-swift-local [19:53:23] (03CR) 10Reedy: [C: 032] Add botadmin user group on fa.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154126 (https://bugzilla.wikimedia.org/69411) (owner: 10Calak) [19:53:25] yeah [19:53:27] (03Merged) 10jenkins-bot: Add botadmin user group on fa.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154126 (https://bugzilla.wikimedia.org/69411) (owner: 10Calak) [19:54:02] (03PS2) 10Reedy: Maintenance reports limit incremental increase. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154986 (owner: 10Springle) [19:54:02] so there seem to be timeouts talking to memcached in swift proxies, aaron mentioned that could be the cause of the 401s [19:54:06] (03CR) 10Reedy: [C: 032] Maintenance reports limit incremental increase. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154986 (owner: 10Springle) [19:54:10] (03Merged) 10jenkins-bot: Maintenance reports limit incremental increase. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154986 (owner: 10Springle) [19:54:58] godog: that makes sense. since retrying works from time to time [19:55:30] yeah, still trying to understand what's going on exactly [19:55:46] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 13s) [19:55:52] Logged the message, Master [19:55:53] (03PS2) 10Legoktm: Enable GlobalCssJs on all CentralAuth wikis minus loginwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154432 (https://bugzilla.wikimedia.org/57891) [20:03:07] i'm out, back later [20:04:16] (03PS2) 10BryanDavis: Add 'trebuchet' package provider. [operations/puppet] - 10https://gerrit.wikimedia.org/r/155603 (https://bugzilla.wikimedia.org/59931) (owner: 10Ori.livneh) [20:10:48] (03PS1) 10Dzahn: fix mailman queue monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/155620 [20:10:53] matanya: ^ [20:11:03] !log restarted swift-proxy on ms-fe1001 [20:11:10] Logged the message, Master [20:12:19] thanks mutante that makes sense [20:13:15] (03CR) 10Dzahn: "actually the standard is for nagios plugins to take args -w and -c , we just use $1 so far.. but oh well" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/155620 (owner: 10Dzahn) [20:18:51] (03CR) 10Dzahn: [C: 032] "defining the checkcommand itself should not be necessary, instead handled by the nrpe check class" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155620 (owner: 10Dzahn) [20:20:12] (03PS1) 10Jgreen: add SPF record for donate.wm.o and update IPs for *.donate.wm.o SPF record [operations/dns] - 10https://gerrit.wikimedia.org/r/155623 [20:20:30] matanya: and.. typo :p sigh [20:20:43] (03CR) 10jenkins-bot: [V: 04-1] add SPF record for donate.wm.o and update IPs for *.donate.wm.o SPF record [operations/dns] - 10https://gerrit.wikimedia.org/r/155623 (owner: 10Jgreen) [20:22:27] (03PS1) 10Dzahn: fix typo in mailman queue check path [operations/puppet] - 10https://gerrit.wikimedia.org/r/155624 [20:23:40] (03CR) 10Dzahn: [C: 032] fix typo in mailman queue check path [operations/puppet] - 10https://gerrit.wikimedia.org/r/155624 (owner: 10Dzahn) [20:27:33] (03PS2) 10Jgreen: update IPs for *.donate.wm.o SPF record [operations/dns] - 10https://gerrit.wikimedia.org/r/155623 [20:28:52] (03CR) 10Jgreen: [C: 032 V: 031] update IPs for *.donate.wm.o SPF record [operations/dns] - 10https://gerrit.wikimedia.org/r/155623 (owner: 10Jgreen) [20:33:54] (03PS1) 10Filippo Giunchedi: swift: increase max memcache connections [operations/puppet] - 10https://gerrit.wikimedia.org/r/155629 [20:36:50] anyone for a quick review of the above? [20:41:42] godog, are the memcache conns in ganglia? [20:42:00] and what's the current (default) value? [20:42:14] jeremyb: afaik not for the memcache that runs on swift, default is 2 [20:42:24] aha [20:42:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: increase max memcache connections [operations/puppet] - 10https://gerrit.wikimedia.org/r/155629 (owner: 10Filippo Giunchedi) [20:42:51] for that matter how many workers? [20:43:07] I think defaults to number of CPUs [20:44:14] !log rolling restart of swift-proxy on ms-fe1* [20:44:20] Logged the message, Master [20:47:47] (03PS1) 10Dzahn: move mailman queue monitor script to /usr/local [operations/puppet] - 10https://gerrit.wikimedia.org/r/155632 [20:48:57] (03CR) 10Dzahn: [C: 032] move mailman queue monitor script to /usr/local [operations/puppet] - 10https://gerrit.wikimedia.org/r/155632 (owner: 10Dzahn) [20:52:34] (03PS3) 10Ori.livneh: Add 'trebuchet' package provider. [operations/puppet] - 10https://gerrit.wikimedia.org/r/155603 (https://bugzilla.wikimedia.org/59931) [20:55:23] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Epic puppet fail [20:56:01] Thehelpfulone, mutante: epic ^^ [20:56:23] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:56:26] it's not true, i saw the run being finished [20:56:28] there [20:56:47] i dunno, but sometimes icinga-wm is a liar [20:57:11] ugh [20:57:21] why is code on pl.wp throwing "Uncaught exception: Error: Unknown dependency: mediawiki.util" [20:57:53] can somebody touch the file or something? [20:58:13] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [20:58:24] hah, icinga-wm still lying? [20:58:35] i bet it recovers soon :p [20:58:42] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC [20:58:58] now that one is probably true [20:59:04] hey we see some a lot of problems about broken js on de.wiki since ~40 min. No tool box on edit, no search autocomplete, etc. (everything that requires js, js-debugger also cries about TypeError: mw.config is null"; "TypeError: $.wikiEditor is undefined" etc.). [20:59:06] please touch/sync/whatever "resources/src/mediawiki/mediawiki.util.js" in core [20:59:23] if it doesn't help, i'll look into it some more [20:59:25] but i bet that will help [20:59:31] Reedy: [20:59:32] greg-g: Reedy ^ [20:59:47] Reedy was having IRC issues earlier, I'm not sure he's online [21:00:09] he just did the deployment [21:00:22] [13:03:07] i'm out, back later [21:00:22] [13:03:14] reedy|webirc (41085ae4@gateway/web/freenode/ip.65.8.90.228) left IRC. [21:00:23] I don't know, but I just saw another weird one (right after editing common.js): out of memory on fr.wikipedia, using Firefox [21:00:26] that was an hour ago [21:00:37] helderwiki: in LocalStorage? that's a known issue [21:00:43] helderwiki: that's a diffe- yes [21:00:45] then my browsers crashed, and then I disabled javascript and reverted my changes [21:00:46] greg-g is on vacation also, I am looking around [21:01:00] is it so hard to touch a file…? [21:01:01] no, not on local storage [21:01:10] hey, i'm here [21:01:12] touching it will make RL reload its caches and stuff [21:01:17] MatmaRex: what should i touch? [21:01:19] resources/src/mediawiki/mediawiki.util.js [21:01:24] oh hi ori, thank you [21:01:25] which branch? [21:01:39] the wikipedias one [21:02:01] wmf17 [21:02:41] !log ori Synchronized php-1.24wmf17/resources/src/mediawiki/mediawiki.util.js: Touch resources/src/mediawiki/mediawiki.util.js (duration: 00m 06s) [21:02:46] Logged the message, Master [21:02:53] (03CR) 10Jeremyb: "RT 8173" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155610 (owner: 10Reedy) [21:03:03] legoktm: the full error message was just "Uncaught Exception: Out of Memory" [21:03:10] :o [21:03:14] so it's not https://bugzilla.wikimedia.org/show_bug.cgi?id=65364 ? [21:03:46] how's it looking now? [21:03:48] (03CR) 10Dzahn: "root@sodium:~# cat /etc/nagios/nrpe.d/check_mailman_queue.cfg" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155632 (owner: 10Dzahn) [21:04:15] ori: i'm looking, was making sure to purge caches and stuff [21:04:25] we might have to wait five minutes?… eh [21:04:26] might need to wait 5 min... [21:04:45] nope, not 65364 (I'm getting used to that one =/ ) [21:06:26] matanya: ..and now it works https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=sodium&service=mailman_queue_size [21:09:48] ? [21:10:27] ugh [21:10:44] doesn't seem to have helped much :( [21:11:15] * ori reads backlog [21:11:31] MatmaRex: if at any point you think you have it figured out, ping [21:11:52] i am mostly getting "Unhandled Error: Cannot convert 'mw.config' to object" now [21:12:07] something, somehow, is unsetting mw.config after it has been set… [21:12:19] (03CR) 10BryanDavis: Add 'trebuchet' package provider. (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/155603 (https://bugzilla.wikimedia.org/59931) (owner: 10Ori.livneh) [21:12:30] (03PS1) 10Dzahn: lower TTL of svn.wm.org to 5M [operations/dns] - 10https://gerrit.wikimedia.org/r/155635 [21:15:41] (03CR) 10Dzahn: [C: 032] lower TTL of svn.wm.org to 5M [operations/dns] - 10https://gerrit.wikimedia.org/r/155635 (owner: 10Dzahn) [21:16:09] what was the wmf-config sync? [21:18:25] dewp is having issues too (reported in -commons) [21:19:17] what do we know so far? when was the earliest issue reported? [21:19:51] from -tech: [13:57:52] numerous users from the French Wikipedia are reporting their browser throwing a script error since ~10 minutes [21:19:56] timestamp is PDT [21:20:18] it works again now [21:20:40] people complained on #wikipedia-fr and Twitter. [21:20:58] ori: https://gerrit.wikimedia.org/r/#/c/155568/1/wikiversions.json [21:21:07] MatmaRex: if you are able to reproduce, can you try something like https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object/watch ? [21:21:12] legoktm: the frwiki problem may be what I was talking about (or not) [21:21:24] I can't reproduce on dewp [21:23:35] (03CR) 10Dzahn: "yes, but that doesn't keep the compiler from throwing another warning line because of it" [operations/puppet] - 10https://gerrit.wikimedia.org/r/154373 (owner: 10Dzahn) [21:24:56] (03PS2) 10Dzahn: exim templates - deprecated variable syntax [operations/puppet] - 10https://gerrit.wikimedia.org/r/154371 [21:26:02] (03CR) 10Dzahn: [C: 031] "this is for ottomata" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155452 (owner: 10Dzahn) [21:27:34] hi [21:27:45] hi [21:27:51] do you see anything in your javascript error console? [21:27:55] (03CR) 10Dzahn: "manual rebase? argg..so just tell me, should the template be inside the module" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155111 (owner: 10Dzahn) [21:28:00] ehhhh [21:28:03] i think i've got it [21:28:09] there are module that depends on 'mediawiki' or 'jquery' [21:28:19] (03CR) 10CSteipp: [C: 031] OTRS - use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/153998 (owner: 10Dzahn) [21:28:45] :< [21:28:46] where? [21:28:53] https://pl.wikipedia.org/w/index.php?title=MediaWiki:Gadgets-definition&diff=40209245&oldid=40052673 [21:29:05] maybe elsewhere too… [21:29:52] I don't see any on https://fr.wikipedia.org/wiki/MediaWiki:Gadgets-definition [21:30:06] nothing on https://de.wikipedia.org/wiki/MediaWiki:Gadgets-definition either [21:30:08] "easy" way to check: go to https://bits.wikimedia.org/pl.wikipedia.org/load.php?debug=false&lang=pl&modules=startup&only=scripts&skin=vector&* (replace for other wikis) and search for "jquery" and "mediawiki" [21:30:23] should appear exactly once when the module itself is mentioned, never as a dependency [21:30:44] why is it a problem for a module to depend on mediawiki or jquery? [21:31:07] because krinkle changed some things recently and now it explodes :D [21:31:19] they seem to not have guards against being ran more than once [21:31:42] because they're loaded with 'raw', not through the normal loader [21:31:46] so 'mediawiki' runs, then the module for mw.config runs and sets it, then 'mediawiki' runs again and replaces the global mediaWiki object [21:31:50] boom, no mw.config [21:31:51] maybe just use mwgrep to find any Gadget definitions with dependencies on jquery or mediawiki? [21:32:21] but I don't see anything on dewp or frwp [21:32:45] it was always said that you shouldn't add a dependency on them, but never enforced [21:32:59] is dewp and frwp still broken? [21:33:00] we have a test that checks for it, but that won't run on gadgets. [21:33:11] maxxl2: are things still broken? [21:33:21] yep - no change [21:33:33] ori: https://gerrit.wikimedia.org/r/#/c/152122/ is probably what broke it [21:33:35] hm [21:34:32] yes [21:34:34] * Mark "jquery" and "mediawiki" as Raw modules. While the startup [21:34:34] module had this already, these didn't. Without this, they'd [21:34:34] get the conditional wrap – which would be a problem since mediawiki.js [21:34:36] can't be conditional on 'window.mw' for that file defines that [21:34:38] namespace itself. [21:35:33] maxxl2: just to make totally sure...you're not using IE6 right? [21:35:47] not - its ff31 [21:35:56] ok :P [21:37:37] soltion is: empty cache and f5 [21:37:46] now it works again [21:37:47] and it starts working again? [21:38:11] even autocomplete is back [21:38:23] toolbar are back [21:38:27] on de.wp we removed jquery and mediawiki module from CommonsDirekt MediaWiki:Gadgets-definition, there is still some popular gadget causing problems for some users [21:41:06] https://de.wikipedia.org/w/index.php?title=MediaWiki%3AGadgets-definition&diff=133315241&oldid=133124742 [21:41:49] so for dewp it's probably just caching now [21:42:58] no, it's not [21:42:58] i'm working on a fix, sec [21:42:58] ok [21:47:50] monobook seems to have probs still [21:49:33] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2017: active_shards: 6050: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:55:21] MatmaRex: +1 for https://gerrit.wikimedia.org/r/#/c/155643/ ? [21:55:25] or legoktm [21:55:38] * legoktm looks [21:55:43] it's a hack, so I don't intend to try and get the patch for master merged [21:55:50] just the wmf17 cherry-pick [21:56:26] +1'd [21:56:53] thanks [21:57:42] !log performing elasticsearch upgrade on elastic1015 [21:57:48] Logged the message, Master [21:58:08] !log ori Synchronized php-1.24wmf17/resources/src/mediawiki/mediawiki.js: I8d27442d1: Workaround for bug introduced by Icf6ede09b (duration: 00m 03s) [21:58:13] thx - mate +1 [21:58:14] Logged the message, Master [21:59:50] ugh. [22:00:33] yes, ugly. [22:03:02] I have a less-hacky fix [22:03:20] less-hacky fixes > hacky fixes [22:08:24] (03CR) 10Dzahn: [C: 032] bugzilla - use ssl_ciphersuite to add HSTS [operations/puppet] - 10https://gerrit.wikimedia.org/r/154978 (owner: 10Dzahn) [22:09:13] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:12:42] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [22:14:16] (03CR) 10Ori.livneh: Add 'trebuchet' package provider. (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/155603 (https://bugzilla.wikimedia.org/59931) (owner: 10Ori.livneh) [22:15:54] (03PS1) 10Ori.livneh: Use Trebuchet package provider for RCStream [operations/puppet] - 10https://gerrit.wikimedia.org/r/155648 [22:16:29] (03PS1) 10Dzahn: bugzilla - remove superfluous STS header setting [operations/puppet] - 10https://gerrit.wikimedia.org/r/155649 [22:16:59] mutante: thanks for the IRL ping btw [22:17:46] ori: sure, np [22:18:25] (03CR) 10Dzahn: "resulting difference is only whitespace, but also a follow-up here: Change-Id: I3f9bf70655f90fac9" [operations/puppet] - 10https://gerrit.wikimedia.org/r/154978 (owner: 10Dzahn) [22:19:00] (03CR) 10BryanDavis: Add 'trebuchet' package provider. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/155603 (https://bugzilla.wikimedia.org/59931) (owner: 10Ori.livneh) [22:19:08] (03PS2) 10Dzahn: bugzilla - remove superfluous STS header setting [operations/puppet] - 10https://gerrit.wikimedia.org/r/155649 [22:20:19] (03CR) 10Dzahn: [C: 032] "Header add Strict-Transport-Security "max-age=31536000"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155649 (owner: 10Dzahn) [22:22:22] (03CR) 10Dzahn: "actually, it's also "Header set" vs. "Header add", but for that, see: Change-Id: I76180c650d1af64df5" [operations/puppet] - 10https://gerrit.wikimedia.org/r/154978 (owner: 10Dzahn) [22:28:30] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/241/change/153998/diff/iodine.wikimedia.org.diff.formatted" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153998 (owner: 10Dzahn) [22:31:08] (03CR) 10MZMcBride: "Huzzah!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/154964 (https://bugzilla.wikimedia.org/61283) (owner: 10John F. Lewis) [22:32:13] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 36 data above and 0 below the confidence bounds [22:32:13] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 36 data above and 0 below the confidence bounds [22:33:32] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:34:13] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [22:34:13] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [22:34:22] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [22:34:49] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/242/change/153967/diff/ytterbium.wikimedia.org.diff.formatted" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153967 (owner: 10Dzahn) [22:36:09] (03PS2) 10Dzahn: gerrit - use apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/153849 [22:39:11] (03CR) 10CSteipp: [C: 031] "In this configuration (with $wgUseGlobalSiteCssJs explicitly set to false), I'm ok with this going out. If anyone with edituserjs rights a" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154432 (https://bugzilla.wikimedia.org/57891) (owner: 10Legoktm) [22:40:31] The script error problem is back on frwiki [22:40:35] Script : https://bits.wikimedia.org/fr.wikipedia.org/load.php?debug=false&lang=fr&modules=jquery%2Cmediawiki&only=scripts&skin=vector&version=20140821T215722Z:4 [22:41:37] (03CR) 10Dzahn: "here's the puppet compiler output to proof it's not functional change: http://puppet-compiler.wmflabs.org/243/change/154368/diff/virt1000." [operations/puppet] - 10https://gerrit.wikimedia.org/r/154368 (owner: 10Chmarkine) [22:42:19] (03Abandoned) 10Jdlrobson: Get MediaWiki UI in front of people [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154972 (owner: 10Jdlrobson) [22:44:13] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [22:45:06] Ash_Crow: could you clear you cache? [22:46:39] (03PS3) 10Dzahn: put svn.wikimedia.org behind misc. varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/155077 [22:47:00] (03PS4) 10Dzahn: put svn.wikimedia.org behind misc. varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/155077 [22:47:09] Ash_Crow: For some (unknown) reason Firefox and Chrome froze after I updated addOnloadHook to jQuery on this edit: https://fr.wikipedia.org/w/index.php?diff=106712089 [22:47:18] so I just reverted it and give up for now [22:47:30] (03CR) 10Dzahn: "splitting this up into 2 changes, only adding varnish config first, then switching..then removing useless backend config" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155077 (owner: 10Dzahn) [22:47:59] some of those scripts must be doing something uncommon to break in a change like this =/ [22:51:46] (03CR) 10Dzahn: [C: 032] put svn.wikimedia.org behind misc. varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/155077 (owner: 10Dzahn) [22:59:42] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC [23:00:05] (03PS2) 10Dzahn: switch svn.wm.org over to misc-web-lb.eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/155078 [23:01:07] ACKNOWLEDGEMENT - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed 20 Aug 2014 22:49:01 UTC daniel_zahn HHVM - administratively disabled (Reason: reason not specified) [23:02:18] osmium - 2 processes with command name 'hhvm' [23:03:30] (03CR) 10Dzahn: [C: 032] switch svn.wm.org over to misc-web-lb.eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/155078 (owner: 10Dzahn) [23:04:45] eh, no.. grrr [23:17:31] swat will happen in a bit, i'll do it [23:22:50] ori: Thanks! [23:41:56] (03PS4) 10Ori.livneh: Add 'trebuchet' package provider and role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/155603 (https://bugzilla.wikimedia.org/59931) [23:42:32] James_F: any chance you could submit the submodule update for https://gerrit.wikimedia.org/r/#/c/155601/ ? [23:42:39] (03CR) 10jenkins-bot: [V: 04-1] Add 'trebuchet' package provider and role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/155603 (https://bugzilla.wikimedia.org/59931) (owner: 10Ori.livneh) [23:42:39] ori: Sure, no problem. [23:42:42] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00666666666667 [23:43:13] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Epic puppet fail [23:43:48] (03PS5) 10Ori.livneh: Add 'trebuchet' package provider and role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/155603 (https://bugzilla.wikimedia.org/59931) [23:44:18] (03PS6) 10Ori.livneh: Add 'trebuchet' package provider and role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/155603 (https://bugzilla.wikimedia.org/59931) [23:46:21] (03PS2) 10Ori.livneh: Use Trebuchet package provider for RCStream [operations/puppet] - 10https://gerrit.wikimedia.org/r/155648 [23:47:42] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [23:48:31] ori: Done as https://gerrit.wikimedia.org/r/155667 [23:49:07] (03CR) 10Ori.livneh: "catalog compiler for rcs1001: http://puppet-compiler.wmflabs.org/244/change/155648/html/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155648 (owner: 10Ori.livneh) [23:52:33] (03CR) 10BryanDavis: "a couple of trivial nits inline, but overall this looks pretty awesome." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/155603 (https://bugzilla.wikimedia.org/59931) (owner: 10Ori.livneh) [23:57:12] !log ori Started scap: SWAT: d3de89777, 7abfe0d5e7, 8ec9853c32b, 476e9e90bd01 [23:57:17] Logged the message, Master