[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151230T0000). [00:00:54] nothing to swat [00:02:51] andrewbogott: so....40 minutes later [00:03:06] I think 10k is ok I guess but we need to figure out why we are thrashing it [00:03:17] as long as there aren't more one-time assignment ops that this [00:03:26] some truncated query seems livable to track down [00:03:34] yeah [00:03:59] I’ll log a bug about that [00:04:19] and then I’m going to try to get CI working again, and then quit for the day. Tomorrow is untangle-all-those-overlapping-accounts day [00:04:29] aka ‘wmf holiday' [00:04:54] well, can you try to help untangle grid engine with us [00:05:10] we should call releng I guess for CI things if it's critical but atm gridegine is dead and no clue [00:05:16] 6operations, 10Wikimedia-Apache-configuration, 7HHVM: Transition to HHVM broke old links to wiki.phtml - https://phabricator.wikimedia.org/T122629#1909365 (10MaxSem) 3NEW [00:06:22] yeah, I’ll try to catch up [00:08:21] !log restarting nova-compute on labvirt1002 [00:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:22] 6operations, 10Wikimedia-Apache-configuration, 7HHVM: Transition to HHVM broke old links to wiki.phtml - https://phabricator.wikimedia.org/T122629#1909388 (10Krenair) Yeah, IMHO these should be permanent redirects [00:13:58] 6operations, 10Wikimedia-Apache-configuration, 7HHVM: Transition to HHVM broke old links to wiki.phtml - https://phabricator.wikimedia.org/T122629#1909389 (10Rillke) https://www.google.com/?q=wiki.phtml+site:commons.wikimedia.org approx. 1.010 results. [00:15:14] RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [00:15:53] (03PS4) 10Andrew Bogott: openldap: Drop the novaadmin query limit from 'unlimited' [puppet] - 10https://gerrit.wikimedia.org/r/261588 [00:18:07] (03CR) 10Andrew Bogott: [C: 032] openldap: Drop the novaadmin query limit from 'unlimited' [puppet] - 10https://gerrit.wikimedia.org/r/261588 (owner: 10Andrew Bogott) [00:19:01] chasemp, YuviPanda, I’m going to enable puppet on the ldap boxes, which will result in #comment lines being added to a config which will result in restarts of ldap. Just so you know :) [00:19:13] andrewbogott: can we not do that now? [00:19:21] sure :) [00:19:25] I just hate leaving puppet off [00:19:35] once gridengine works :) [00:36:08] Don't know if this is related at all to the failures you're dealing with - [00:36:12] https://www.irccloud.com/pastebin/nKBIDhMt/ [00:37:23] YuviPanda: ^ no hurry and all that, let me know whenever [00:38:14] madhuvishy: yeah, unrelated. should clear up after an apt-get hopefully [00:38:26] mm hmm, okay [00:39:52] PROBLEM - SSH on technetium is CRITICAL: Server answer [00:41:26] okay, this keeps coming up [00:41:32] What does 'Server answer' mean? [00:42:02] RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [00:43:41] (03PS1) 10Mark Bergsma: Enable RPS on eth0 on labstores [puppet] - 10https://gerrit.wikimedia.org/r/261598 [00:46:28] (03PS2) 10Mark Bergsma: Enable RPS on eth0 on labstores [puppet] - 10https://gerrit.wikimedia.org/r/261598 [01:14:46] Krenair: looks like it may be a truncated attempt to output something more -- https://github.com/nagios-plugins/nagios-plugins/blob/master/plugins/check_ssh.c#L234 [01:15:05] Maybe the bot is splitting on : ? [01:24:49] thanks bd808 [01:24:52] maybe [01:38:47] PROBLEM - puppet last run on mw2007 is CRITICAL: CRITICAL: puppet fail [01:41:42] (03PS1) 10Yuvipanda: gridengine: Add berkley db commandline utilities to master [puppet] - 10https://gerrit.wikimedia.org/r/261607 [01:42:58] (03PS2) 10Yuvipanda: gridengine: Add berkley db commandline utilities to master [puppet] - 10https://gerrit.wikimedia.org/r/261607 [01:43:24] (03CR) 10coren: [C: 031] "You hope you never have to use 'em, they are godsent when you do." [puppet] - 10https://gerrit.wikimedia.org/r/261607 (owner: 10Yuvipanda) [01:46:19] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [01:48:03] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 973165 bytes in 12.728 second response time [01:48:28] PROBLEM - SSH on technetium is CRITICAL: Server answer [01:48:38] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [01:53:53] (03CR) 10Yuvipanda: [C: 032] gridengine: Add berkley db commandline utilities to master [puppet] - 10https://gerrit.wikimedia.org/r/261607 (owner: 10Yuvipanda) [01:55:32] nfs is-fine-ish [01:55:35] we're doing recovery [01:56:37] RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [02:04:21] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:05:35] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.046 second response time [02:06:23] RECOVERY - puppet last run on mw2007 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [02:13:30] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [02:14:19] !log restbase: canary deploy of 7db8e216 (small bug fix & a security fix) to restbase1001 [02:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:14:30] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [02:15:30] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [02:17:17] !log restbase: starting full deploy of 7db8e216 (small bug fix & a security fix) to production cluster [02:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:20] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 09m 50s) [02:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:49] !log restbase: finished full deploy of 7db8e216 (small bug fix & a security fix) to production cluster [02:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:16] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Dec 30 02:32:15 UTC 2015 (duration 6m 55s) [02:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:52:13] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 383 bytes in 0.006 second response time [02:56:37] (03PS1) 10Dzahn: toollabs: disable paging for tools-home/NFS [puppet] - 10https://gerrit.wikimedia.org/r/261610 [02:58:11] PROBLEM - SSH on technetium is CRITICAL: Server answer [03:00:23] (03CR) 10Yuvipanda: [C: 031] toollabs: disable paging for tools-home/NFS [puppet] - 10https://gerrit.wikimedia.org/r/261610 (owner: 10Dzahn) [03:00:58] (03CR) 10Dzahn: [C: 032] toollabs: disable paging for tools-home/NFS [puppet] - 10https://gerrit.wikimedia.org/r/261610 (owner: 10Dzahn) [03:02:05] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [03:02:55] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [03:03:56] RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [03:24:04] PROBLEM - SSH on technetium is CRITICAL: Server answer [03:24:48] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 958126 bytes in 4.488 second response time [03:28:14] RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [03:49:51] PROBLEM - SSH on technetium is CRITICAL: Server answer [03:51:52] RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [03:58:01] PROBLEM - SSH on technetium is CRITICAL: Server answer [04:04:43] RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [04:44:20] PROBLEM - SSH on technetium is CRITICAL: Server answer [04:48:29] RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [04:53:30] RECOVERY - NTP on technetium is OK: NTP OK: Offset -0.004868984222 secs [04:53:50] RECOVERY - dhclient process on technetium is OK: PROCS OK: 0 processes with command name dhclient [04:54:10] RECOVERY - RAID on technetium is OK: OK: no RAID installed [04:54:21] RECOVERY - salt-minion processes on technetium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [04:54:39] RECOVERY - DPKG on technetium is OK: All packages OK [04:54:59] RECOVERY - Check size of conntrack table on technetium is OK: OK: nf_conntrack is 0 % full [04:56:10] RECOVERY - puppet last run on technetium is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:31:18] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:18] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:59] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:59] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:58] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:58] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:17] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:27] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:27] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: puppet fail [06:56:58] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:07] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:19] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:57:39] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:48] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:48] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:49] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:58:07] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:18] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:28] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:01:43] !log setting dbstore1001 to read_only, converting ruwiki.recentchanges back to InnoDB [08:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:06:54] there is some kind of interaction between toku, database dumps (and multisource replication?), that makes an insertion there fail the first time, then it succeeds after being read, and auto-generates a duplicate key error [09:05:16] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 631 [09:15:16] RECOVERY - check_mysql on db1008 is OK: Uptime: 751102 Threads: 86 Questions: 30429633 Slow queries: 9095 Opens: 58689 Flush tables: 2 Open tables: 418 Queries per second avg: 40.513 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:43:56] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1909731 (10ArielGlenn) salt updated on deployment-prep except for deployment-restbase01 which is running sid. I haven't built sid packages and don't plan to. After the update, one host is no... [10:00:17] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 616 [10:20:18] RECOVERY - check_mysql on db1008 is OK: Uptime: 755002 Threads: 92 Questions: 30508861 Slow queries: 9247 Opens: 58692 Flush tables: 2 Open tables: 419 Queries per second avg: 40.408 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 60 [10:30:18] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 660 [10:35:10] RECOVERY - check_mysql on db1008 is OK: Uptime: 755902 Threads: 93 Questions: 30532287 Slow queries: 9269 Opens: 58692 Flush tables: 2 Open tables: 419 Queries per second avg: 40.391 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:21:19] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [11:46:43] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:52:10] (03PS1) 10Nemo bis: [English Planet] Add Greg Sabino Mullane [puppet] - 10https://gerrit.wikimedia.org/r/261626 [11:53:27] (03PS2) 10Nemo bis: [English Planet] Add Greg Sabino Mullane [puppet] - 10https://gerrit.wikimedia.org/r/261626 [12:18:42] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1909799 (10ArielGlenn) wikidata-stats.wikidata-dev.eqiad.wmflabs: Minion did not return. [No response] tools-worker-1002.tools.eqiad.wmflabs: Minion did not return. [No response] >>... [12:29:36] (03CR) 10Faidon Liambotis: [C: 031] "Yeah, this makes sense. Note that labstores have 1Gbps NICs (bnx2) and we've never tried RPS/RSS there before." [puppet] - 10https://gerrit.wikimedia.org/r/261598 (owner: 10Mark Bergsma) [12:37:42] (03CR) 10Mark Bergsma: "Yeah, alternatively we could try running irqbalance indeed." [puppet] - 10https://gerrit.wikimedia.org/r/261598 (owner: 10Mark Bergsma) [12:45:36] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1909805 (10Aklapper) >>! In T109810#1572602, @Jalexander wrote: > but let me check with the lawyers first. @JAlexander: Did that happen? Any outcome? [12:49:42] 6operations, 7Easy, 5Patch-For-Review: server admin log should include year in date (again) - https://phabricator.wikimedia.org/T85803#1909811 (10Aklapper) @Elee: Any news here? Are you still working on this (as you're set as assignee)? [12:54:11] (03PS3) 10Mark Bergsma: Enable RPS on eth0 on labstores [puppet] - 10https://gerrit.wikimedia.org/r/261598 [12:55:45] (03CR) 10Mark Bergsma: Enable RPS on eth0 on labstores (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/261598 (owner: 10Mark Bergsma) [13:14:23] !log labstore1001: apt-get install irqbalance [13:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:07] (03CR) 10Mark Bergsma: "I've just installed irqbalance, which is by no means optimal but better than nothing for now." [puppet] - 10https://gerrit.wikimedia.org/r/261598 (owner: 10Mark Bergsma) [13:18:48] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail [13:27:56] so all that really did is move all network interrupts off cpu#0 to a different cpu [13:28:02] which is only very marginally better ;) [13:44:44] (03CR) 10coren: [C: 031] "Very much sane." [puppet] - 10https://gerrit.wikimedia.org/r/261598 (owner: 10Mark Bergsma) [13:45:19] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:06:28] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures [14:25:05] (03CR) 10Luke081515: [C: 031] Enable global AbuseFilter at French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257868 (https://phabricator.wikimedia.org/T120568) (owner: 10Glaisher) [14:34:37] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:45:08] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 13.64% of data above the critical threshold [100000000.0] [14:54:52] Coren, mark, is ^ your patch? [14:55:31] andrewbogott: I've only +1'ed it - I don't think Mark merged yet. [14:55:45] oh, good point. [14:56:03] andrewbogott: But me and valhallasw`cloud are looking at a lot of huge gridengine logs that live on NFS, it may be our fault. [14:57:25] !log restarting puppet on serpens; openldap will restart but config should not change [14:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:47] RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [75000000.0] [15:05:47] !log restarting puppet on seaborgium; openldap will restart but config should not change [15:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:27] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1909891 (10Ottomata) [15:17:25] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1591622 (10Ottomata) [15:32:57] akosiaris: Hi Alex, do you know how to model dependencies between mediawiki extensions in jenkins? In https://gerrit.wikimedia.org/r/#/c/259167/7 we add Wikidata support for math. But Jenkins seems to be unaware of Wikidata. [15:42:24] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1909963 (10Ottomata) [15:43:39] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1591624 (10Ottomata) Ok, I think we are ready on the Analytics side. We'll need to do some things right after this change is made, so some planning is in order over in https://p... [16:05:01] jouncebot, next [16:05:01] In 0 hour(s) and 54 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151230T1700) [16:07:03] _joe_, can we do https://gerrit.wikimedia.org/r/260593 ? [16:08:10] there's a) a freeze, and b) people are supposed to be off until monday, so what do you think :) [16:18:56] I wasn't sure how much that affected the puppet changes [17:00:04] _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151230T1700). [17:12:08] Coren: would you expect a ‘find /srv’ on labstore to complete… ever? [17:12:11] I’m starting to wonder [17:12:28] andrewbogott: Err, yes, very many hours later. [17:12:54] andrewbogott: /srv has hundreds of millions of files [17:12:57] What I need is a way to find all homedirs for a given user [17:13:13] my current plan is find . -type d -name "home" > allhomes.txt [17:13:20] andrewbogott: Heh. Much simpler: [17:13:24] and then grep allhomes for the actual user’s home (to avoid running find ever again) [17:13:49] cd srv;echo */*/home/ [17:14:13] Glob is much smarter about it, because it knows to not try to traverse [17:14:16] it’ll always be that exact depth? [17:14:37] Yeah, because /srv/project/foo/home or /srv/other/foo/home [17:14:40] hm... [17:14:43] ok, that’s easier then :) [17:16:26] hm, that ‘echo’ succeeds whether or not there’s a match [17:16:28] weird [17:19:46] PROBLEM - Host mr1-codfw.oob is DOWN: CRITICAL - Network Unreachable (mr1-codfw.oob.wikimedia.org) [17:20:56] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/7: down - Transit: CyrusOne OOB (IP-000008-01) {#1099} [1Gbps Cu]BR [17:21:03] andrewbogott: Not sure what you mean? globs will give you all the matches you want to itself when there are none. That's what globs do. :-) [17:21:45] right, it’s just funny that if the glob doesn’t match anything it echos the pattern isntead. [17:21:49] # echo */*/home/lissacoffey [17:21:49] */*/home/lissacoffey [17:22:01] not a problem, just surprises me [17:22:28] That's why code that relies on going through a glob always have a test. Like: [17:23:01] for home in */*/home/the_user;do if [ -d "$home" ]; then do_something_to $home; fi; done [17:23:33] In your case, I expect do_something_to is akin to chown the_user $home :-) [17:30:09] Coren: before I break things, mind a quick look at /srv/chownhome.sh ? [17:30:27] andrewbogott: You'll have to paste it for me. [17:30:31] dpaste* [17:30:33] ah, yes, sorry :) [17:30:46] https://dpaste.de/XNZM [17:31:53] andrewbogott: Intended to run on labstore*? [17:32:08] yeah, on labstore1001 for each of the affected users [17:32:11] (there are lots :( ) [17:32:17] https://dpaste.de/HeRV [17:32:34] Removed the undeeded echo and added the much-important useldap [17:32:46] Otherwise, chown has no idea who you're talking about. [17:33:05] Aw, I like the echo [17:33:33] oh, I see [17:34:01] Wouldn't have broken thing in this specific case, but never trust expansion of filenames; they always contains spaces at the worst of times. :-) [17:34:24] The glob expands correctly to tokens. [17:35:07] yep, ok [17:35:40] (Also, every use of $homedir should be quoted to "$homedir" for that reason, but again this is not an issue in this specific case). [17:36:27] ok, here goes... [17:37:50] well, that was a letdown, apparently only 7 of those 150 users had ever logged in [17:39:26] Heh. "Sorry the task ended up being fairly easy?" :-) [17:46:30] andrewbogott: the task can probably also be un-'security'-ed? [17:54:06] valhallasw`cloud: I don’t know. Publishing the task will encourage people to go looking for overlaps. I don’t think there are any, but nevertheless... [17:54:19] (Of course, now I’m talking about it in a public channel…) [17:54:46] yeah, and I just posted a followup task without security tag [17:55:34] I dunno, there's more effective ways to wreak havoc in labs than this ;-) [17:56:59] valhallasw`cloud: yeah, ok. [18:16:17] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [18:20:27] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 34.78 ms [18:39:57] PROBLEM - HHVM rendering on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:37] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:18] PROBLEM - SSH on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:18] PROBLEM - salt-minion processes on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:41:37] PROBLEM - HHVM processes on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:41:58] PROBLEM - configured eth on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:42:08] PROBLEM - Disk space on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:42:29] PROBLEM - RAID on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:42:38] PROBLEM - nutcracker port on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:42:48] PROBLEM - nutcracker process on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:42:58] PROBLEM - Check size of conntrack table on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:08] PROBLEM - puppet last run on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:17] PROBLEM - DPKG on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:46:27] PROBLEM - dhclient process on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:46:45] (03PS1) 10Halfak: Sets ORES redis cache_maxmemory => '2G' [puppet] - 10https://gerrit.wikimedia.org/r/261642 [18:47:50] (03CR) 10Halfak: "See https://phabricator.wikimedia.org/T122666" [puppet] - 10https://gerrit.wikimedia.org/r/261642 (owner: 10Halfak) [18:50:57] RECOVERY - nutcracker port on mw1123 is OK: TCP OK - 0.000 second response time on port 11212 [18:50:58] RECOVERY - nutcracker process on mw1123 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:51:08] RECOVERY - Check size of conntrack table on mw1123 is OK: OK: nf_conntrack is 0 % full [18:51:27] RECOVERY - DPKG on mw1123 is OK: All packages OK [18:51:28] RECOVERY - SSH on mw1123 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [18:51:28] RECOVERY - salt-minion processes on mw1123 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:51:47] RECOVERY - HHVM processes on mw1123 is OK: PROCS OK: 6 processes with command name hhvm [18:52:08] RECOVERY - configured eth on mw1123 is OK: OK - interfaces up [18:52:18] RECOVERY - Disk space on mw1123 is OK: DISK OK [18:52:28] RECOVERY - dhclient process on mw1123 is OK: PROCS OK: 0 processes with command name dhclient [18:52:47] RECOVERY - RAID on mw1123 is OK: OK: no RAID installed [19:06:48] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:09:03] 6operations, 10Traffic: Varnish apparently unconditionally varies on cookie value - https://phabricator.wikimedia.org/T122673#1910240 (10GWicke) 3NEW [20:09:19] 6operations, 10Traffic: Varnish apparently unconditionally varies on cookie value - https://phabricator.wikimedia.org/T122673#1910250 (10GWicke) [20:10:00] 6operations, 10Traffic: Varnish apparently unconditionally varies on cookie value - https://phabricator.wikimedia.org/T122673#1910240 (10GWicke) [20:10:57] 6operations, 10Traffic: Varnish apparently unconditionally varies on session cookies - https://phabricator.wikimedia.org/T122673#1910254 (10GWicke) [20:16:27] 6operations, 10Traffic: Varnish apparently unconditionally varies on session cookies - https://phabricator.wikimedia.org/T122673#1910265 (10GWicke) This vary behavior seems to be hardcoded in [evaluate_cookie](https://github.com/wikimedia/operations-puppet/blob/650721dba65c57ac6edc77ff2a55f155a78ba32d/template... [20:27:46] (03PS1) 10GWicke: Varnish: Don't disable caching for authenticated REST API requests [puppet] - 10https://gerrit.wikimedia.org/r/261662 (https://phabricator.wikimedia.org/T122673) [20:28:37] 6operations: Translate extension seemingly broken / partially installed - https://phabricator.wikimedia.org/T122675#1910285 (10coren) 3NEW [20:28:44] (03PS2) 10GWicke: Varnish: Don't disable caching for authenticated REST API requests [puppet] - 10https://gerrit.wikimedia.org/r/261662 (https://phabricator.wikimedia.org/T122673) [20:28:49] 6operations: Translate extension seemingly broken / partially installed on wikimedia2017wiki - https://phabricator.wikimedia.org/T122675#1910292 (10coren) [20:33:10] 6operations, 10Traffic, 5Patch-For-Review: Varnish apparently unconditionally varies on session cookies - https://phabricator.wikimedia.org/T122673#1910297 (10GWicke) The patch above adds another exception that prevents the no-cache override from applying to /api/rest_v1/. It's not really a complete solution... [20:34:23] 6operations, 10Traffic, 5Patch-For-Review, 7Performance: Varnish apparently unconditionally varies on session cookies - https://phabricator.wikimedia.org/T122673#1910298 (10GWicke) [20:34:24] (03PS1) 10Yuvipanda: redis: Set vm_overcommit = 1 for all redises [puppet] - 10https://gerrit.wikimedia.org/r/261663 [20:34:36] ori: ^ are you around? [20:34:46] yes [20:34:48] what's up? [20:35:33] ori: see patch. [20:35:45] the ores redis just started puking because of lack of vm_overcommit [20:35:55] then I realized we'll have to add that to literally all our roles [20:36:09] (03PS3) 10GWicke: Varnish: Don't disable caching for authenticated REST API requests [puppet] - 10https://gerrit.wikimedia.org/r/261662 (https://phabricator.wikimedia.org/T122673) [20:36:11] since they all use persistance [20:36:27] (03CR) 10Ori.livneh: [C: 031] redis: Set vm_overcommit = 1 for all redises [puppet] - 10https://gerrit.wikimedia.org/r/261663 (owner: 10Yuvipanda) [20:36:37] ori: thanks [20:37:38] (03CR) 10Yuvipanda: [C: 032] redis: Set vm_overcommit = 1 for all redises [puppet] - 10https://gerrit.wikimedia.org/r/261663 (owner: 10Yuvipanda) [20:45:24] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1910301 (10Eevans) [21:22:34] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 1 failures [21:29:16] (03PS1) 10Ori.livneh: redis: small lint fix [puppet] - 10https://gerrit.wikimedia.org/r/261724 [21:29:30] (03CR) 10Ori.livneh: [C: 032 V: 032] redis: small lint fix [puppet] - 10https://gerrit.wikimedia.org/r/261724 (owner: 10Ori.livneh) [21:50:14] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:02:22] o/ apergos [22:02:41] Do you know where I would report an issue with pagecounts dumps? [22:02:52] E.g. pagecounts-20150509-060000.gz has compression errors [22:05:28] halfak: there is a report already about invalid bzip2 [22:06:05] Hmm.. This is gzip [22:08:39] halfak: everything goes in phab :) [22:09:49] Reedy, yes, but what would be the project. Who own that? [22:10:31] Datasets-General-or-Unknown [22:11:20] Sure it's not Dumps-Generation? [22:12:54] It's not an xml/mysql dump is it? [22:14:11] Nope. It's a pageview dump. Also hosted on dumps.wikimedia.org. [22:14:53] Use General then [22:15:07] Aha! It's Datasets-Webstatscollector [22:18:06] andre__: Should #phabrictor upstream task contain the project phabricator? [22:18:12] or should I remove it? [22:33:58] Luke081515: Depends on each specific task. No "general" rule. [22:34:40] Luke081515: if we won't solve it / work around it's upstream only. If we might have a local "solution" it might be both. If it's just our config it's only #phabricator. [22:35:02] ok, thanks [22:48:25] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail [22:49:32] at midnight o clock I admit I was out [23:02:00] 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1910616 (10Krenair) a:5Betacommand>3Krenair ```krenair@tin:~$ mwscript eval.php enwiki > echo ExternalStore::insertToDefault( gzdeflate( "SYSADMIN NOTE: Text of this r... [23:03:23] 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1910618 (10Krenair) I guess we should update rev_len (currently 78946) and rev_sha1 (currently blank) as well? [23:16:19] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [23:20:38] 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1910638 (10Betacommand) Yeah, sorry for the delay in getting back to this, I have a dump from a few months after this, but it doesnt look like the revision is in it.