[00:45:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [01:44:03] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 19:17:00 UTC [01:51:44] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:35] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53293 bytes in 0.095 second response time [02:17:49] !log LocalisationUpdate completed (1.24wmf15) at 2014-08-04 02:16:46+00:00 [02:17:59] Logged the message, Master [02:29:02] !log LocalisationUpdate completed (1.24wmf16) at 2014-08-04 02:27:58+00:00 [02:29:08] Logged the message, Master [02:46:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [03:12:09] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Aug 4 03:11:03 UTC 2014 (duration 11m 2s) [03:12:15] Logged the message, Master [03:24:13] greg-g: I'm going to start a list of "stuff that needs to get swat deployed after Wikimania]] on Deployments because I have somethings that need backports before they hit non-phase0 wikis, but aren't urgent enough for emergency deploys [03:25:02] that should have been: Wikimania" on [[Deployments]] [03:30:42] https://wikitech.wikimedia.org/wiki/Deployments#Week_of_August_11th [03:30:47] I put odder's stuff in there too [03:45:03] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 19:17:00 UTC [04:06:24] legoktm: thanks sir. Helpful. [04:43:28] (03PS2) 10MZMcBride: Grant reedy root [operations/puppet] - 10https://gerrit.wikimedia.org/r/122621 (owner: 10Reedy) [04:47:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [05:46:03] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 19:17:00 UTC [06:23:13] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:33] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:43] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:54] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:43] PROBLEM - puppet last run on db1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:41:14] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:45:43] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:43] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:48:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [06:48:04] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:53:43] RECOVERY - puppet last run on db1006 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:47:03] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 19:17:00 UTC [08:49:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [08:55:54] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19989 MB (3% inode=99%): [08:56:53] (03CR) 10JanZerebecki: [C: 031] Set up redirects for toolserver.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/151523 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [08:58:54] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19997 MB (3% inode=99%): [09:18:15] <_joe_> akosiaris: oh congratulations [09:18:46] _joe_: :-) [09:24:24] morning akosiaris [09:24:57] could you maybe run this script on dataset to update a link :D https://gerrit.wikimedia.org/r/#/c/151134/ [09:30:38] Nemo_bis: done [09:37:00] akosiaris: Hm, I'm still seeing the old HTML on https://dumps.wikimedia.org/other/pagecounts-raw/ [09:38:28] yes, I noticed that too, puppet has not run successfully on the host, investigating [09:41:32] akosiaris: thank you very much for etherpad work [09:42:51] SAL mentions no intended puppet disabling for test reasons [09:46:06] matanya: you are welcome [09:46:20] Nemo_bis: it is not disabled... Error: Command exceeded timeout [09:48:05] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 19:17:00 UTC [09:51:28] <_joe_> akosiaris: yummy [09:59:12] (03CR) 10Zhuyifei1999: [C: 04-1] "Per Hoo. This change can't just go deployed and breaking a number of pages without consensus of doing so." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150301 (https://bugzilla.wikimedia.org/68815) (owner: 10Reedy) [10:05:53] (03CR) 10Nemo bis: "Zhuyifei1999, "break"? What would be broken? As said above, we need feature requests filed to know what https://www.mediawiki.org/wiki/Ext" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150301 (https://bugzilla.wikimedia.org/68815) (owner: 10Reedy) [10:06:33] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: Puppet has 1 failures [10:09:34] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures [10:11:34] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures [10:11:43] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 1 failures [10:13:33] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [10:16:24] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures [10:20:32] (03CR) 10Tpt: [C: 04-1] "Wikibase doesn't currently cover all features of RelatedSites. I'll write something about missing features (and it would be nice to talk a" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150301 (https://bugzilla.wikimedia.org/68815) (owner: 10Reedy) [10:30:42] (03PS1) 10Giuseppe Lavagetto: apache: adding apache::mod_files [operations/puppet] - 10https://gerrit.wikimedia.org/r/151605 [10:30:44] (03PS1) 10Giuseppe Lavagetto: mediawiki: use apache::mod_files [operations/puppet] - 10https://gerrit.wikimedia.org/r/151606 [10:39:54] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20200 MB (3% inode=99%): [10:42:40] (03CR) 10Giuseppe Lavagetto: "See the dependent change to see this in action." [operations/puppet] - 10https://gerrit.wikimedia.org/r/151605 (owner: 10Giuseppe Lavagetto) [10:43:15] (03CR) 10QChris: [WIP] Create icinga alerts if hourly HDFS webrequest imports have missing or duplicate data (037 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 (owner: 10Ottomata) [10:49:54] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19803 MB (3% inode=99%): [10:50:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [10:55:44] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.01 [11:04:37] ok why are my dataset and snapshots unhappy [11:07:22] my that's a lot of access log [11:07:33] activity must have seriously picked up [11:07:43] RECOVERY - Puppet freshness on dataset1001 is OK: puppet ran at Mon Aug 4 11:07:35 UTC 2014 [11:10:44] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [11:22:47] (03PS1) 10Giuseppe Lavagetto: Add fix for hanging fastcgi connections [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/151613 [11:23:16] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add fix for hanging fastcgi connections [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/151613 (owner: 10Giuseppe Lavagetto) [11:33:07] (03PS1) 10ArielGlenn: turn off dataset rsync to labs til labstore1003 mnt issue is fixed [operations/puppet] - 10https://gerrit.wikimedia.org/r/151614 [11:35:08] (03PS1) 10Gage: Deb for logstash-gelf.jar: liblogstash-gelf-java [operations/debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/151615 [11:35:32] (03CR) 10ArielGlenn: [C: 032] turn off dataset rsync to labs til labstore1003 mnt issue is fixed [operations/puppet] - 10https://gerrit.wikimedia.org/r/151614 (owner: 10ArielGlenn) [11:38:15] (03CR) 10Hashar: contint: tie android SDK packages to Precise (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151048 (owner: 10Hashar) [11:38:23] (03PS2) 10Hashar: contint: tie android SDK packages to Precise [operations/puppet] - 10https://gerrit.wikimedia.org/r/151048 [11:41:54] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20071 MB (3% inode=99%): [11:48:54] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20090 MB (3% inode=99%): [11:51:25] (03CR) 10Danmichaelo: [C: 031] Tools: Install php5-imagick [operations/puppet] - 10https://gerrit.wikimedia.org/r/151551 (https://bugzilla.wikimedia.org/69078) (owner: 10Tim Landscheidt) [11:51:41] (03PS1) 10ArielGlenn: datasets: don't mount labs fs if cron using it is not enabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/151621 [11:54:38] (03CR) 10ArielGlenn: [C: 032] datasets: don't mount labs fs if cron using it is not enabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/151621 (owner: 10ArielGlenn) [12:02:22] (03PS1) 10ArielGlenn: datasets: really turn off labs cron instead of pretending to [operations/puppet] - 10https://gerrit.wikimedia.org/r/151622 [12:04:04] (03CR) 10ArielGlenn: [C: 032] datasets: really turn off labs cron instead of pretending to [operations/puppet] - 10https://gerrit.wikimedia.org/r/151622 (owner: 10ArielGlenn) [12:07:15] (03PS1) 10ArielGlenn: datasets: remove labs cron file deps when cron itself is not enabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/151623 [12:08:06] one day I'll get this done [12:08:31] (03CR) 10Alexandros Kosiaris: [C: 032] contint: tie android SDK packages to Precise [operations/puppet] - 10https://gerrit.wikimedia.org/r/151048 (owner: 10Hashar) [12:08:50] (03PS2) 10ArielGlenn: datasets: remove labs cron file deps when cron itself is not enabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/151623 [12:08:54] rebase [12:10:01] (03CR) 10ArielGlenn: [C: 032] datasets: remove labs cron file deps when cron itself is not enabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/151623 (owner: 10ArielGlenn) [12:38:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] Update for kafka 0.8.1.1-2 packaging (031 comment) [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/151193 (owner: 10Ottomata) [12:44:55] (03PS1) 10ArielGlenn: datasets... still trying to kill labs cron with fire... grrrrr [operations/puppet] - 10https://gerrit.wikimedia.org/r/151627 [12:46:21] (03CR) 10ArielGlenn: [C: 032] datasets... still trying to kill labs cron with fire... grrrrr [operations/puppet] - 10https://gerrit.wikimedia.org/r/151627 (owner: 10ArielGlenn) [12:47:54] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [12:51:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [12:54:54] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20148 MB (3% inode=99%): [13:06:26] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/149387 (owner: 10Hashar) [13:06:28] (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/149387 [13:06:32] (03CR) 10jenkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/149387 (owner: 10Hashar) [13:07:12] (03CR) 10Hashar: "recheck" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/149387 (owner: 10Hashar) [13:08:43] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [13:08:43] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [13:10:33] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:10:34] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [13:14:24] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:14:43] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:17:34] !log stopped labs rsync job from dataset1001, mount of labstore1003 was borked, removed 90GB of stuff on /mnt/data (= /) filesystem, restarted nfsd on dataset1001, dumps back to going [13:17:38] Logged the message, Master [13:17:43] Coren, when you have a chance... ^^ [13:21:03] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Mon 04 Aug 2014 11:19:56 UTC [13:45:44] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0233333333333 [13:51:53] (03CR) 10Ottomata: Update for kafka 0.8.1.1-2 packaging (031 comment) [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/151193 (owner: 10Ottomata) [14:00:34] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Mon Aug 4 14:00:31 UTC 2014 [14:08:45] (03PS1) 10Phuedx: Enable GuidedTour extension on tewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/151639 (https://bugzilla.wikimedia.org/69103) [14:09:13] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:04] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.034 second response time [14:10:44] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [14:20:59] (03PS4) 10Giuseppe Lavagetto: hhvm: provide hhvm-api-$VERSION [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150845 [14:27:33] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hhvm: provide hhvm-api-$VERSION [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150845 (owner: 10Giuseppe Lavagetto) [14:32:09] chasemp: yt? [14:42:45] (03PS1) 10Giuseppe Lavagetto: lintian fixes [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/151644 [14:52:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [14:53:38] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] lintian fixes [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/151644 (owner: 10Giuseppe Lavagetto) [15:02:25] (03PS1) 10Giuseppe Lavagetto: New version of the package [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/151648 [15:02:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] New version of the package [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/151648 (owner: 10Giuseppe Lavagetto) [15:03:59] (03CR) 10Ottomata: "Disagree. group membership not only gives access to systems, but is also used for file access. Sure, I could access anything as root, bu" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150560 (owner: 10Ottomata) [15:06:21] anybody avail for a short user account brain bounce? [15:06:41] godog: yt? [15:07:45] akosiaris: ? :) [15:08:02] yup [15:08:22] ok so [15:08:30] there is a historical 'stats' user and group [15:09:12] we previously used this in hadoop to limit access to webrrequest data in hdfs [15:09:14] files were owned [15:09:21] hdfs:stats 640 [15:09:38] so you had to be in the stats group ON analytics1010 to access that data in HDFS (from any Hadoop client) [15:09:45] (e.g. stat1002) [15:10:04] now that we have the new admin module in puppet [15:10:16] and we recently reinstalled the cluster [15:10:19] i want to fix this up and make it better [15:10:30] currently, there exists an analytics-users data.yaml group [15:10:40] this group gives access to relevant analytics nodes [15:11:06] i think that there should be a secondary group [15:11:12] that gives access to the data [15:11:17] they would be different groups [15:11:34] as in the future we will likely want to grant access to hadoop but not all out access to the data [15:11:49] there is also currently a statistics-privatedata-users group [15:12:02] this group gives access to stat1002 and sampled udp2log webrequest data there [15:12:16] so [15:12:17] perhaps [15:12:29] analytics-users: gives access to hadoop and relevant analytics nodes [15:12:45] analytics-privatedata-users: gives access to same nodes, AND privatedata files are group-owned as this [15:12:54] whatcha think? [15:14:16] this is a little annoying, because when users submit RTs for hadoop-webrequest log access [15:14:44] they will need to be added to both of those groups, AND some hadoop commands will need to be run (mk hdfs home dir, chown it properly, etc.) [15:14:52] so you want to create a group that is going to be a subset of the larger group of analytics user in order to control access to data [15:15:03] (03PS1) 10Giuseppe Lavagetto: fix changelog entry missing year [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/151650 [15:15:13] yes. [15:15:34] and why add them to both groups ? [15:15:39] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "Thanks a lot emacs-debian-changelog..." [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/151650 (owner: 10Giuseppe Lavagetto) [15:15:44] and not just "upgrade" them ? [15:15:49] i suppose they wouldn't need to be added to the first group: analytics-users [15:15:49] hm [15:16:06] hm [15:16:09] so the two groups are no longer superset/subset ? [15:16:29] well, both groups should be included in all the same places [15:16:37] its just file permissions that will matter [15:16:54] (03CR) 10Ori.livneh: "I thought about this a lot this weekend and did some digging of some of the old documentation around the Debian Apache 2 file layout, and " [operations/puppet] - 10https://gerrit.wikimedia.org/r/151605 (owner: 10Giuseppe Lavagetto) [15:16:55] also, we don't currently have anyone in the first group [15:17:08] hadoop is currently only being used for webrequest data [15:17:14] so, everyone who wants access wants that [15:17:33] are there plans for other uses too ? [15:17:59] yea, aside from more generic use cases [15:18:07] we plan to have sanitized and aggregated data sets [15:18:11] that are less sensitive [15:18:42] the data we have right now is raw logs from varnishkafka [15:19:15] (03CR) 10Giuseppe Lavagetto: apache: adding apache::mod_files (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151605 (owner: 10Giuseppe Lavagetto) [15:19:19] but analytics devs and researchers are working on processes to transform that data into something more easily useable and less private [15:19:21] that's the goal anyway [15:19:47] some of that cleaned up data will be exported out to other places for easier access: files, myslq maybe, who knows [15:19:53] but that is a long time down the road [15:20:20] (03PS1) 10Ori.livneh: Re-implement apache::mod_conf as custom type [operations/puppet] - 10https://gerrit.wikimedia.org/r/151652 [15:21:05] well both approaches have some merit. If you have the two groups separate, you need to be explicit about which groups you give access to each time and kind of duplicate that effort every time but since it is explicit, it is self documented [15:21:41] the reverse approach of having the two group being superset/subset means less work every time but less documentation and possibly flexibility [15:21:55] is subset supported by admin module? [15:22:14] not automatically, right? [15:22:17] no... it is not even a posix thing I just invented it [15:22:22] haha, right [15:22:27] we'd just have to maintain the list that way [15:23:15] I think that being separate groups sounds more flexible [15:23:29] so, users in privatedata would not necessarily be in -users? [15:23:34] yes [15:23:46] guess that's fine... [15:23:46] hm [15:23:59] ok [15:24:24] it has some extra maintenance cost but somehow I feel it says better why a user account exists in a system [15:24:59] why being a list of things in this case [15:25:44] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.0166666666667 [15:26:29] aye hm [15:34:03] PROBLEM - Puppet freshness on db1011 is CRITICAL: Last successful Puppet run was Mon 04 Aug 2014 13:33:39 UTC [15:34:10] (03PS1) 10Ottomata: Create analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151657 [15:34:50] (03CR) 10jenkins-bot: [V: 04-1] Create analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151657 (owner: 10Ottomata) [15:35:34] akosiaris: https://gerrit.wikimedia.org/r/#/c/151657/ [15:35:35] brb [15:36:08] (03CR) 10Tim Landscheidt: wmflib: add funcs requires_realm() and requires_ubuntu() (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/148422 (owner: 10Ori.livneh) [15:39:47] I'm running a service on labs and it's gone down 3 times in the last 3 hours- any tips on diagnosing why? [15:39:54] up until now has been very stable :( [15:40:18] mvolz: #wikimedia-labs would be a good place to ask [15:40:26] whoops [15:40:28] wrong room [15:40:29] :) [15:40:31] sorry :) [15:40:33] np [15:43:56] (03PS1) 10Alexandros Kosiaris: Moving labsdb1006, labsdb1007 to RAID5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/151661 [15:45:12] (03CR) 10Phuedx: [C: 04-1] "Enabling this extension is currently awaiting approval on tewiki. See the bug for details." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/151639 (https://bugzilla.wikimedia.org/69103) (owner: 10Phuedx) [15:45:34] (03CR) 10Alexandros Kosiaris: [C: 032] Moving labsdb1006, labsdb1007 to RAID5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/151661 (owner: 10Alexandros Kosiaris) [15:48:51] (03PS2) 10Ottomata: Create analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151657 [15:49:32] (03CR) 10jenkins-bot: [V: 04-1] Create analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151657 (owner: 10Ottomata) [15:53:31] (03PS3) 10Ottomata: Create analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151657 [15:53:34] RECOVERY - Puppet freshness on db1011 is OK: puppet ran at Mon Aug 4 15:53:32 UTC 2014 [15:54:11] (03CR) 10jenkins-bot: [V: 04-1] Create analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151657 (owner: 10Ottomata) [15:55:03] (03PS4) 10Ottomata: Create analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151657 [16:00:21] (03CR) 10Ori.livneh: Nutcracker: move declaration to role::mediawiki; parametrize (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/149800 (owner: 10Ori.livneh) [16:02:13] (03CR) 10Ori.livneh: Nutcracker: move declaration to role::mediawiki; parametrize (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/149800 (owner: 10Ori.livneh) [16:02:21] (03PS5) 10Ori.livneh: Nutcracker: move declaration to role::mediawiki; parametrize [operations/puppet] - 10https://gerrit.wikimedia.org/r/149800 [16:06:57] (03CR) 10Ottomata: Update for kafka 0.8.1.1-2 packaging (031 comment) [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/151193 (owner: 10Ottomata) [16:07:22] (03PS2) 10Ottomata: Update for kafka 0.8.1.1-2 packaging [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/151193 [16:31:35] (03CR) 10Ottomata: [WIP] Create icinga alerts if hourly HDFS webrequest imports have missing or duplicate data (037 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 (owner: 10Ottomata) [16:35:12] (03CR) 10Vogone: [C: 031] "It seems like centralauth-merge is assigned to "*" by default, anyway. So I'd like to see this merged and centralauth-rename/centralauth-m" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139655 (owner: 10Gerrit Patch Uploader) [16:36:11] (03CR) 10Ori.livneh: "cherry-picked on labs; it applied correctly" [operations/puppet] - 10https://gerrit.wikimedia.org/r/149800 (owner: 10Ori.livneh) [16:45:44] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [16:53:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [16:58:42] (03PS1) 10BryanDavis: hhvm: Enable admin server [operations/puppet] - 10https://gerrit.wikimedia.org/r/151675 [17:02:18] (03CR) 10BryanDavis: "Merging shouldn't hurt anything, but it would probably be more useful to wait for Giuseppe to tell me where he'd like the associated Apach" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151675 (owner: 10BryanDavis) [17:02:51] ottomata|lunch: about now man :) [17:07:23] (still chowing a bit) [17:07:33] chasemp: https://gerrit.wikimedia.org/r/#/c/151657/ [17:07:54] had a discussion with akosiaris, there are kind of 2 levels of users in hadoop, and we need both [17:07:57] ah I'm rush in gerrit :) [17:08:01] ah! [17:08:04] because it's fun to confuse everyone even me! [17:08:19] heh [17:08:26] so, yeah, they both grant access to hadoop [17:08:42] but the privatedata files will be only group-readable by the analytics-privatedata-users grou [17:08:48] (unless you have a better name) [17:09:15] it's ugly but sane I guess [17:11:37] not being a hadoop genius I think this is a reasonable approach :) [17:11:43] what prompted this since there seems to be no normal analytics-users? [17:12:05] re: https://gerrit.wikimedia.org/r/#/c/150560/ [17:12:35] I think you may be right, my commentary was mostly for the case of three classes: 1. a, b, c 2. a, b 3. a [17:12:50] in that kind of case 1 should not be a member of 3 as there are no rights being assinged [17:13:09] like someone in something-admins shouldn't be in something-users unless there are some non-overlapping grants [17:13:24] but if hdfs is the exception, cool with me, but I do think it's the exception and should not be the rule [17:13:25] chasemp, right now we only use hadoop for private data crunching [17:13:35] but the intention is to generate santized and aggregate datasets for more general use [17:13:44] so we want to be able to restrict access to the raw data files [17:14:00] for analytics-admins, its a little different [17:14:06] analytics-admins get access to more nodes [17:14:09] (not just hadoop client nodes) [17:14:13] ah, sounds good, maybe more verbose descriptions of role intentions? you can do multiline descriptions fyi [17:14:14] and also sudo to certain users [17:14:19] ok cool [17:14:20] the yaml syntax is flexible [17:14:29] yeah ok, i'll do that for both in just a bit... [17:19:13] (03PS6) 10Ottomata: [WIP] Create icinga alerts if hourly HDFS webrequest imports have missing or duplicate data [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 [17:22:17] (03PS5) 10Ottomata: Create analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151657 [17:27:29] (03PS2) 10Ottomata: Create analytics-admins group with qchris as member [operations/puppet] - 10https://gerrit.wikimedia.org/r/150560 [17:27:41] ok chasemp, i think for the analytics-admins case, you are right, i don't need to be in that group [17:27:49] but I should be in the analytics-privatedata-users group [17:27:59] anyway, cool, both changes amended with better descriptions [17:30:19] akosiaris: aside from that typo (fixed), are you ok with the kafka puppet change? [17:30:29] https://gerrit.wikimedia.org/r/#/c/151193/ [17:32:36] that makes sense to me man [17:33:24] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:15] ottomata: could you make a note in the privatedata group just that even ops needs to be a member here to access it? [17:34:35] generally the assumption shouuld be ops can see all / do all so noting exceptions would be good for us non analytics savy folks [17:34:43] hmmmmm, could put root in that group [17:36:39] hmmm...kinda meh on that idea but not sure either way really, maybe in general it should be explicit for even ops to get private data? [17:37:07] I'm really open to either way but I like the idea of the only ppl being the ppl on that list I guess [17:37:19] and getting root not being a 'side door' but idk [17:38:36] well, same same, right, ideally if you have root access you'd be able to access anything [17:38:50] and technically, you could (just edit sudoers, or usermod ...) [17:39:04] you have a point, but that should leave a trail..? [17:39:04] idk man [17:39:15] actually, i do think root shoudl ahve access to this anyway, just like it does on stat1002 with the sampled data [17:39:24] ha, actually [17:39:27] you could if you were root [17:39:29] sudo -u hdfs [17:39:32] is how I usually do it [17:39:33] heh [17:39:39] (hdfs is the hdfs super usr) [17:39:51] sure but leaves a trail right? of a superuser accessing privatedata [17:39:55] unsure if matters [17:40:12] but in terms of truly privatedata I've gone with the assumption that administrative users access [17:40:22] should always result in a log somewhere [17:40:22] hmmmm, you know, with that as a though, i think its ok not to give all ops group membership. [17:40:24] sudo -u hdfs makes sense [17:40:29] agreed [17:40:38] its equivalent to just sudo normally [17:40:49] and we don't need root in there either [17:40:58] although... [17:40:59] HM [17:41:01] ha [17:41:19] so, one of the use cases for this data is being able to process is and look for say, DoS attack IPs or whatever [17:41:34] and, it will be easier to interact with this data if ops folks do so just like other analytics folks [17:41:38] e.g. [17:41:40] hive --database wmf_raw [17:41:40] vs [17:41:45] sudo -u hdfs hive database wmf_raw [17:41:54] so hm [17:42:05] maybe it is good to have opsen in privatedata-users?" [17:44:09] ah just noticed you continued the thought on :) [17:44:13] ha [17:44:33] well that makes sense too actually either way :) [17:45:18] honestly this kind of thing depends on frequency and sanity of access [17:45:28] if it's a once a year kind of thing sudo -u hdfs is cool with me you know [17:45:40] I've not ever used it in the time I've been here but maybe others have [17:45:48] kind of a tossup I leave to your sensabilities [17:46:31] (03PS2) 10ArielGlenn: data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 [17:46:36] (03CR) 10jenkins-bot: [V: 04-1] data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 (owner: 10ArielGlenn) [17:47:15] hm, ok, chasemp, let's not bother with it now, as it is just another extra administrative thing to do [17:47:20] if it becomse a thing, we can always add people alter [17:47:39] sounds good [17:48:49] ok, gimme your +1s if you got 'em :) [17:49:49] (03CR) 10Rush: [C: 031] Create analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151657 (owner: 10Ottomata) [17:50:25] (03CR) 10Rush: [C: 031] Create analytics-admins group with qchris as member [operations/puppet] - 10https://gerrit.wikimedia.org/r/150560 (owner: 10Ottomata) [17:50:44] (03PS3) 10ArielGlenn: data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 [17:50:47] (03CR) 10jenkins-bot: [V: 04-1] data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 (owner: 10ArielGlenn) [17:57:43] (03CR) 10Alexandros Kosiaris: [C: 031] Update for kafka 0.8.1.1-2 packaging [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/151193 (owner: 10Ottomata) [17:58:05] thanks [18:02:45] (03CR) 10Ori.livneh: [C: 032] hhvm: Enable admin server [operations/puppet] - 10https://gerrit.wikimedia.org/r/151675 (owner: 10BryanDavis) [18:19:52] (03PS1) 10Dr0ptp4kt: Log Internet.org via header in X-Analytics when appropriate [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 [18:21:08] (03CR) 10Dr0ptp4kt: [C: 04-1] "Do not merge yet. Discussing if this is the okay with Analytics." [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [18:21:55] (03PS2) 10Dr0ptp4kt: WIP: Log Internet.org via header in X-Analytics when appropriate [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 [18:34:03] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [18:42:20] ottomata: what you want to know about passive checks ? [18:45:07] ah, just trying to understand how they work, i guess: so [18:45:13] hm, actually [18:45:23] i'm about to do kafka upgrade with gage, need to think about this more [18:45:30] will ask you more later akosiaris, thanks :) [18:45:42] later being tomorrow for me :-) [18:45:46] sure [18:45:52] ja i'm not going to get to it today [18:45:53] ok.. have a nice upgrade [18:47:03] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:47:03] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 12 hours old. [18:47:38] (03PS1) 10Yurik: Remove X-CS pre-deliver filtering [operations/puppet] - 10https://gerrit.wikimedia.org/r/151693 [18:48:04] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 0 hours old. [18:48:32] bblack, https://gerrit.wikimedia.org/r/151693 [18:48:37] should make you happy [18:48:53] (03PS2) 10QChris: Remove X-CS pre-deliver filtering [operations/puppet] - 10https://gerrit.wikimedia.org/r/151693 (https://bugzilla.wikimedia.org/69112) (owner: 10Yurik) [18:49:31] greg-g: https://gerrit.wikimedia.org/r/#/c/151691/ probably qualifies as an emergency deploy [18:50:10] (03CR) 10Ottomata: [C: 032 V: 032] Update for kafka 0.8.1.1-2 packaging [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/151193 (owner: 10Ottomata) [18:51:02] (03CR) 10BBlack: [C: 032 V: 032] "Awesome" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151693 (https://bugzilla.wikimedia.org/69112) (owner: 10Yurik) [18:51:11] bblack, thx! [18:51:16] yurikR1: btw, can we also now kill that other if-check with the comment: Assuming 3 month caching, the !~ "zero=" check can be removed after 6/20/2014. [18:51:25] (03PS1) 10Ottomata: Update kafka submodule for 0.8.1.1 upgrade [operations/puppet] - 10https://gerrit.wikimedia.org/r/151694 [18:51:26] bblack, yep [18:52:48] (03CR) 10Rush: "is this meant to merge today? unsure of timeline" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150263 (owner: 10Andrew Bogott) [18:54:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [18:54:56] (03PS1) 10Yurik: Remove obsolete zero= check [operations/puppet] - 10https://gerrit.wikimedia.org/r/151695 [18:54:59] bblack, ^ [18:56:02] (03CR) 10BBlack: [C: 032 V: 032] Remove obsolete zero= check [operations/puppet] - 10https://gerrit.wikimedia.org/r/151695 (owner: 10Yurik) [18:56:06] yurikR1: thanks! [18:56:10] np [18:57:07] !log beginning kafka upgrade: disabling puppet on brokers [18:57:11] Logged the message, Master [18:58:19] legoktm: thanks, on it [18:59:01] (03PS2) 10Ottomata: Update kafka submodule for 0.8.1.1 upgrade [operations/puppet] - 10https://gerrit.wikimedia.org/r/151694 [18:59:07] (03CR) 10Ottomata: [C: 032 V: 032] Update kafka submodule for 0.8.1.1 upgrade [operations/puppet] - 10https://gerrit.wikimedia.org/r/151694 (owner: 10Ottomata) [19:02:46] !log maxsem Synchronized php-1.24wmf16/includes/User.php: https://gerrit.wikimedia.org/r/#/c/151691/ (duration: 00m 06s) [19:02:51] Logged the message, Master [19:03:03] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19980 MB (3% inode=99%): [19:07:55] !log starting upgrade of kafka cluster [19:08:00] Logged the message, Master [19:08:23] PROBLEM - Kafka Broker Server on analytics1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [19:08:55] sorry about the page, i failed to tell icinga about maintenence [19:09:05] i was about to say someone forgot to tell icinga ;] [19:09:06] ok [19:09:23] RECOVERY - Kafka Broker Server on analytics1012 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [19:09:27] (03CR) 10Yurik: "those are .erb files - templates. We should be able to have one common file included from both text & zero" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [19:10:54] looks like we have a regression in /usr/sbin/kafka, list-topic is not supported [19:10:57] oop [19:11:23] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:13] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.044 second response time [19:12:33] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1018 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 13.0 [19:12:33] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1021 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 14.0 [19:17:24] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on analytics1018 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 13.0 ottomata Doing kafka upgrade to 0.8.1.1 [19:17:46] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on analytics1021 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 14.0 ottomata Doing kafka upgrade to 0.8.1.1 [19:22:18] (03PS3) 10Dr0ptp4kt: WIP: Log Internet.org via header in X-Analytics when appropriate [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 [19:23:28] (03PS4) 10Dr0ptp4kt: WIP: Log Internet.org via header in X-Analytics when appropriate [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 [19:28:35] (03CR) 10Dr0ptp4kt: "@BBlack, mind commenting on the feasibility of avoiding the && here in PS2? Is that safe and performant? Note, in PS4 I put this as an inc" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [19:32:37] (03CR) 10Yurik: "adam, not a significant perf impact, just cleaner - why have a redundant check if it slows down understanding? Shorter & more readable cod" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [19:36:04] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20148 MB (3% inode=99%): [19:37:45] (03CR) 10Dr0ptp4kt: "Fair enough, if we're comfortable with its truthiness (and that's good enough in other parts of the code, it seems). I'll update it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [19:38:28] (03PS5) 10Dr0ptp4kt: WIP: Log Internet.org via header in X-Analytics when appropriate [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 [19:38:57] yurikR1: i think in the past the reason we were concerned about the value being /present/ at all was that we were doing somewhat more expensive regex checks. [19:39:17] that is, looking at stuff like user-agent as in mobile-frontend.inc.vcl.erb [19:39:54] yurikR1: but in any case we'll need for bblack and qchris to verify https://gerrit.wikimedia.org/r/151687 looks safe and efficient enough [19:40:55] dr0ptp4kt, technically, regex check would also have the same speed, as it would need to check internally for the value's presence [19:41:23] yurikR1: yeah, one would hope! [19:41:57] yurikR1: although i don't know the compile step behavior (varnish compile down nor regex compile) [19:42:12] dr0ptp4kt, but all that is besides the point - i'm not sure we should have the via= value :) [19:42:20] apergos: ping [19:42:26] yurikR1: what would be the alternative? [19:42:34] hoo: very gone [19:42:37] what's up? [19:43:04] apergos: json dumps for Wikidata failed today [19:43:18] is that due to troubles with the file systems maybe? [19:43:20] dataset1001 ran out of space on / [19:43:23] (mounted nfs) [19:43:25] ah :S [19:43:30] due to the same issue [19:43:35] ouch [19:43:38] was writing to local fs instead of the nf mount [19:43:45] that's also why it failed puppet I suppose? [19:43:49] so I would guess that is what got you [19:43:55] ok, I see [19:43:56] oh yes, the puppet failed ru was exactly that [19:44:09] snaps all had failed puppet for awhle too [19:44:20] all from one stinkin nfs mount [19:45:30] apergos: if puppet is ok can you log it? A.Ko. fought with it a bit this morning :) [19:45:45] oh I logged what I did [19:45:55] and a kos knows as well [19:46:12] I'll just update my bug then [19:46:13] apergos: Any chance we can re-start the wikidata json if stuff is alright now? [19:46:23] PROBLEM - Kafka Broker Server on analytics1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [19:46:27] heh heh I figured that is what it was coming down to [19:46:34] * hoo hides [19:46:35] tomorrow afternoon my time [19:46:40] ok [19:46:48] you coming to London, btw? [19:46:52] no sadly [19:46:57] :/ Ok [19:47:02] but th whole shop can't go anyways, someoen has to mind the store [19:47:12] Yeah, that's true [19:47:18] <_joe_> this kafka broker thing paged me 3 times [19:47:24] <_joe_> is someone looking at it? [19:47:30] still paging? [19:47:45] ottomata: ^^ [19:48:04] Ok, leaving now (am already in London and didn't plan to stay around much longer) [19:48:13] yes [19:48:14] would be super awesome to get stuff restarted [19:48:16] still looking [19:48:20] _joe_ [19:48:23] i do not get pages for this! [19:48:23] RECOVERY - Kafka Broker Server on analytics1012 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [19:48:24] very sorry [19:48:28] ahhh [19:48:35] ah i meant to do maintanence for that, sorry [19:48:41] if you can turn off paging til it's fixed up [19:48:53] i didn't finish clicking the submit button on icinga [19:48:59] ah :-D [19:49:05] >click< just do it [19:49:27] I get the pages too but I'm already in here [19:49:36] in a little while though I won't be, so.... [19:50:09] ok, i think i've got maintenance schedule for this [19:50:35] yurikR1: ok i just saw your email (trying to keep email closed, alas). will think about it, and we can discuss further tomorrow. [19:51:25] great [19:55:03] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20090 MB (3% inode=99%): [19:59:13] PROBLEM - check google safe browsing for wikinews.org on google is CRITICAL: Connection timed out [19:59:33] PROBLEM - puppet last run on analytics1012 is CRITICAL: CRITICAL: Puppet has 1 failures [20:00:05] RECOVERY - check google safe browsing for wikinews.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3913 bytes in 0.096 second response time [20:02:03] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20141 MB (3% inode=99%): [20:04:33] (03PS2) 10Giuseppe Lavagetto: Apache config for Wikipedia using mod_proxy_fcgi [operations/puppet] - 10https://gerrit.wikimedia.org/r/147441 (owner: 10Reedy) [20:05:15] (03CR) 10Giuseppe Lavagetto: [C: 031] "noop on precise, will allow moving to hhvm testwiki." [operations/puppet] - 10https://gerrit.wikimedia.org/r/147441 (owner: 10Reedy) [20:05:40] (03CR) 10Reedy: [C: 04-1] Apache config for Wikipedia using mod_proxy_fcgi (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/147441 (owner: 10Reedy) [20:05:46] _joe_: brackets don't match [20:05:53] (zh(-(hans|hant|cn|hk|sg|tw))|sr(-(ec|el)) [20:06:12] <_joe_> Reedy: ok [20:06:14] needs an extra ) [20:06:42] ori noticed/fixed it in some other commit I think [20:07:07] and retry=0, i think [20:07:55] https://github.com/wikimedia/operations-puppet/commits/production?author=atdt [20:08:01] was it in the beta cluster config iirc? [20:08:48] https://github.com/wikimedia/operations-apache-config/commits/betacluster [20:09:08] yeah, remove leading ( [20:09:51] i'll amend [20:10:10] I was just about to :) [20:11:03] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19903 MB (3% inode=99%): [20:12:29] (03PS3) 10Ori.livneh: Apache config for Wikipedia using mod_proxy_fcgi [operations/puppet] - 10https://gerrit.wikimedia.org/r/147441 (owner: 10Reedy) [20:13:13] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [20:16:44] PROBLEM - Kafka Broker Messages In on analytics1012 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [20:16:44] PROBLEM - Kafka Broker Replica Lag on analytics1012 is CRITICAL: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value CRITICAL: 17204457.0 [20:16:59] woo [20:17:11] nice [20:17:12] that's fine [20:17:15] should have downtimed that too [20:17:17] Reedy: +1? [20:17:22] hm, doesn't icinga have the concept of dependent services? [20:17:33] RECOVERY - puppet last run on analytics1012 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:18:18] ACKNOWLEDGEMENT - Kafka Broker Messages In on analytics1012 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 ottomata Doing upgrade [20:18:18] ACKNOWLEDGEMENT - Kafka Broker Replica Lag on analytics1012 is CRITICAL: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value CRITICAL: 17204457.0 ottomata Doing upgrade [20:25:07] (03CR) 10Ori.livneh: [C: 032] Apache config for Wikipedia using mod_proxy_fcgi [operations/puppet] - 10https://gerrit.wikimedia.org/r/147441 (owner: 10Reedy) [20:25:13] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:27:19] pssh [20:27:33] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1018 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [20:27:50] ACKNOWLEDGEMENT - Kafka Broker Messages In on analytics1012 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 ottomata Doing upgrade [20:27:50] ACKNOWLEDGEMENT - Kafka Broker Replica Lag on analytics1012 is CRITICAL: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value CRITICAL: 7493438.0 ottomata Doing upgrade [20:28:44] RECOVERY - Kafka Broker Replica Lag on analytics1012 is OK: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value OKAY: 0.0 [20:30:33] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1021 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [20:31:44] RECOVERY - Kafka Broker Messages In on analytics1012 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 4957.74601915 [20:37:37] !log stopping puppet on analytics1027 to temporarily disable camus cron job [20:37:42] Logged the message, Master [20:38:01] (03PS1) 10Ori.livneh: HHVM: add ::hhvm::status [operations/puppet] - 10https://gerrit.wikimedia.org/r/151772 [20:49:03] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: Last successful Puppet run was Mon 04 Aug 2014 18:48:31 UTC [20:55:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [20:57:03] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: Last successful Puppet run was Mon 04 Aug 2014 18:56:13 UTC [20:58:02] manybubbles: yt? [20:58:21] ottomata: yeah! I'm here for another half our or so [20:58:33] ok [20:58:33] so [20:58:40] elastic1016 is lowish on disk space [20:58:47] commonswiki_file is taking up a tone of space there [20:58:56] there are 8 shards of it on 1016 [20:58:59] vs 2 everywhere else [20:59:19] ton* [20:59:24] can/should I try moving one off? [21:00:17] maybe a couple of them should move to 1017? [21:00:50] naw, 1017 doesn't have that much free space either [21:01:04] 1002 has a bunch of free space [21:02:33] RECOVERY - Puppet freshness on analytics1022 is OK: puppet ran at Mon Aug 4 21:02:30 UTC 2014 [21:02:45] manybubbles: thoughts? [21:03:26] ottomata: you can sure try to move the shards off of it [21:03:33] the best thing to do would be to swap with a small shard [21:03:48] because elasticsearch will want to try to balance the number of shards [21:03:53] hm ok [21:03:56] ottomata: https://github.com/elasticsearch/elasticsearch/issues/7155 will help [21:04:07] ottomata: another thing you can do is fiddle with the disk threshold [21:04:32] ottomata: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html [21:06:33] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 12.0 [21:11:03] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19891 MB (3% inode=99%): [21:14:42] (03CR) 10Bsimmers: [C: 031] "lgtm" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151772 (owner: 10Ori.livneh) [21:15:04] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20115 MB (3% inode=99%): [21:19:55] manybubbles: there are only 2 of any given shard on elastic1002 [21:20:05] even though the total number of shards is about right [21:20:10] it just has a lot more variety of shards [21:20:16] of indexes* [21:20:38] ottomata: I imagine it is because elasticsearch just doesn't consider shard size in balancing yet [21:21:16] aye, but i mean, i'm looking for a shard to consider swapping [21:21:22] and there's only 2 of each shard on 1002 [21:21:25] so...hard to pick! [21:21:37] ottomata: just grab a a small one [21:23:46] this one is tiny! [21:23:46] "average_size": 4.675518721342087e-05, [21:23:46] "shards": 1, [21:23:46] "index": "xhwikibooks_content_1403850510" [21:23:47] hah [21:25:41] ok manybubbles, moving a shard off of 1002 onto 1016, and a commonswiki shard from 1016 to 1002 [21:26:14] ottomata: I'm going to log out now because of airplan [21:26:20] less than required [15.0%] free disk on node, [21:26:21] !!! [21:26:21] ok [21:26:40] ok, i guess i'm going to just move the 1016 commonswikifile one first [21:26:42] wish meluck [21:26:44] have a good flight! [21:27:03] ottomata: it'll only move stuff off the node if it drops below some other threshold - like 5% [21:27:07] you can change the threshold [21:27:12] but now I'll log out [21:27:14] ok [21:27:33] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1012 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [21:29:04] RECOVERY - Disk space on elastic1016 is OK: DISK OK [21:30:46] :) [21:31:51] (03PS1) 10Matanya: Add subpages to main namespace on FDC wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/151781 [21:33:23] RECOVERY - Puppet freshness on analytics1021 is OK: puppet ran at Mon Aug 4 21:33:11 UTC 2014 [21:35:35] come on jenkins [21:36:33] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 16.0 [21:41:17] matanya: might be broke [21:41:20] jenkins I mean [21:41:32] (03CR) 10Hashar: "recheck" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/151781 (owner: 10Matanya) [21:41:52] hashar: when you tell me things are broken, it is most likely jenkins or zuul :D [21:42:13] thanks for letting me know [21:42:39] and fixing [21:43:02] matanya: yeah there is an on going crazy bug with zuul/jenkins/gearman i have not figured out :( [21:43:19] yet [21:46:57] !log all kafka brokers upraded to 0.8.1.1 and data replicated: done [21:47:03] Logged the message, Master [21:47:04] yay! [21:47:40] !log reenabling puppet on analytics1027 [21:47:44] Logged the message, Master [21:48:33] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1012 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [22:06:13] (03PS1) 10Matanya: otrs: qualify vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/151787 [22:11:23] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [22:34:40] (03PS1) 10Yurik: Revert some analytics-related X-CS zero code [operations/puppet] - 10https://gerrit.wikimedia.org/r/151794 [22:36:23] (03Abandoned) 10Yurik: Revert some analytics-related X-CS zero code [operations/puppet] - 10https://gerrit.wikimedia.org/r/151794 (owner: 10Yurik) [22:36:39] (03PS1) 10Yurik: Revert some analytics-related X-CS zero code [operations/puppet] - 10https://gerrit.wikimedia.org/r/151795 [22:44:09] anyone alive? preferrably someone who knows swift:P [22:44:19] we have a report that some files are missing [22:44:37] e.g. https://commons.wikimedia.org/wiki/File:A35_towards_Dorchester_-_geograph.org.uk_-_1176376.jpg [22:44:44] Are we going to play Cluedo? [22:45:45] ms-be1012 is down [22:45:58] I'm not sure how if/how much that may be responsible [22:46:42] sjoerddebruin: sure. Magic killed ms-be1012 in the datacenter with a failure. [22:48:04] :D [22:50:34] ugh [22:50:43] MaxSem, let's page some opsen eh? [22:54:31] (03PS2) 10Yurik: Revert some analytics-related X-CS zero code [operations/puppet] - 10https://gerrit.wikimedia.org/r/151795 [22:55:02] bblack, i know you will love this patch ^ [22:55:06] (filippo paged) [22:56:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [22:56:07] I'm around, but I know very little about swift specifics [22:56:10] springle, awake yet ? :) [22:56:41] checking console of dead host ms-be1012 :) [22:57:59] !log rebooting ms-be1012 [22:58:05] Logged the message, Master [23:00:00] (03CR) 10BBlack: [C: 032] ":P" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151795 (owner: 10Yurik) [23:00:04] bblack, when you have a chance, re ^^ patch, apparently its not as rosy as we wanted... analytics is running super advanced code from their own laptops and do not have the cycles to update it to to hande proxies & https, so for now it only handles language & subdomain... in other words bleh! [23:00:12] thx [23:00:13] :) [23:00:26] bblack, thanks for poking, texted sean as well [23:00:47] example of currently failing image: https://commons.wikimedia.org/wiki/File:A35_towards_Dorchester_-_geograph.org.uk_-_1176376.jpg [23:00:55] https://upload.wikimedia.org/wikipedia/commons/f/f4/A35_towards_Dorchester_-_geograph.org.uk_-_1176376.jpg [23:01:03] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [23:04:05] mh... is transwiki import from same wiki working? [23:04:31] (like putting the current wiki in $wgImportSources and the pull from it) [23:07:30] ms-be1012 had lots of panic-like reports in syslog prior to dying. looks like XFS bugs being triggered from swift. e.g. search for "BUG: soft lockup" in /var/log/syslog [23:08:13] doesn't seem to have magically fixed the broken image URL, but it is processing traffic [23:08:26] 1008 had something similar, judging by SAL [23:10:03] mhm, the image is still missing [23:10:44] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.03 [23:11:35] I presume that the files aren't stored on just one server? [23:11:51] But presumably it's going to need a swift person to poke at it... [23:12:01] ie godog I guess [23:15:44] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [23:15:47] ok, no word from filippo or springle , time to ping godog [23:26:36] MaxSem, do we know how pervasive these issues are? [23:27:06] I have only this report in #-tech [23:27:31] 6 images [23:28:13] ok, interestingly half of these are working again [23:29:01] it's remotely possible that varnish cached the 404 for a while and ms-be1012 was the issue [23:29:17] but I donno yet, I'm still digging around in logs for things I don't fully understand :) [23:29:42] I've tried ?action=purge on those URLs that are still failing but to no avail [23:30:41] it's hard to say from the report if these are even necessarily due to the same root cause [23:31:38] bblack, I'll hop on a plane soon -- once you're done with the investigation would you mind sending a quick recap to ops@ so others can follow up [23:31:44] yeah [23:33:46] added URLs here: http://etherpad.wikimedia.org/p/swiftfail [23:35:39] on that one URL from IRC above, as far as I've gotten is varnish handing off to ms-fe100x, and ms-fe1001 has logs showing the 404 and mentioning the address of mw1153, and that host seems to be an image processor [23:35:56] and mw1153's syslog is full of oomkills of image convertors. It looks like that's just normal operation there? [23:45:03] OK, based on Dispenser's comment I'd be careful interpreting this entirely as a current outage. the ms-be1012 reboot seems to have fixed some of these, but others may be more longstanding [23:45:11] Eloquence: I run a thumbnail cache populating script (initially to speed up WikiMiniAtlas thumbnail view). Right now its grabbing 64px thumbnails for an unannounced project. Ought to be done in Novemeber because somebody thought 2-3 requests/second was fast enough [23:45:21] So he's running some kind of batch processing script that's hitting those 404s [23:45:44] ok [23:45:57] that said, further investigation of the impact of the ms-be1012 outage is probably warranted [23:46:12] it worries me if a single node failure causes 404s :( [23:46:56] I'm still reading up on swift and trying to understand how to trace a request properly, so no intelligent comment here :) [23:55:02] correction on earlier comment: no file from dispenser's list is working completely. the ones where the file: page is working still lead to 404s [23:55:30] it's not clear to me that there's any correlation between ms-be1012 and these issues at all - it may just be part of longerstanding corruption issues