[00:04:02] !log krinkle synchronized php-1.23wmf3/resources 'I8704a6620ece44d' [00:04:22] Logged the message, Master [00:04:54] Krinkle: ? [00:05:18] Krinkle: You're in my deploy window ... [00:05:48] :( [00:05:58] bad timo [00:06:10] yo dawg, i heard you like deploys, so i put a deploy in your deploy window so you can deploy while i deploy. [00:06:36] gods, that's not even funny. i should go sleep. [00:06:47] but that joke always SEEMS to make sense [00:06:58] Sorry, forgot to look at the calendar. I intended to do this a few hours back but had it open still. [00:07:33] Krinkle: ping me next time ;) [00:08:43] !log catrope synchronized php-1.23wmf3/extensions/VisualEditor 'add more EventLogging events (https://gerrit.wikimedia.org/r/#/c/94092/ )' [00:09:00] Logged the message, Master [00:09:15] wow, that's a lot [00:09:15] (03PS2) 10Dr0ptp4kt: Load ZeroRatedMobileAccess only where currently supported. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94250 [00:09:38] It's deceptive because every flavor of saveError is a different event [00:09:39] ^MaxSem [00:09:45] er ^^ [00:09:52] Half of that list is just breaking down why exactly the user's save failed [00:10:27] dr0ptp4kt, still 'wikipedia' instead of 'wiki' [00:10:28] (deploy done) [00:10:31] RoanKattouw: I like [00:10:39] MaxSem, ah, ok [00:10:43] MaxSem: But it *is* 'wikipedia' instead of 'wiki' , isn't it? [00:10:43] hang on [00:11:00] That's what $site is set to at least [00:11:15] I just read through the relevant code in CommonSettings.php with dr0ptp4kt looking over my shoulder about an hour ago [00:12:49] (03PS1) 10QChris: Turn off geowiki monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/94290 [00:12:50] (03PS1) 10QChris: Move geowiki's name for research MySQL config into separate variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94291 [00:12:51] (03PS1) 10QChris: Move geowiki's name for globaldev MySQL config into separate variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94292 [00:12:52] (03PS1) 10QChris: Split geowiki paths in base path and scripts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94293 [00:12:53] (03PS1) 10QChris: Move geowiki scripts into geowiki's scripts subdirectory [operations/puppet] - 10https://gerrit.wikimedia.org/r/94294 [00:12:54] (03PS1) 10QChris: Rename geowiki backups to logs, as they are only logs [operations/puppet] - 10https://gerrit.wikimedia.org/r/94295 [00:12:55] (03PS1) 10QChris: Move geowiki data checkout into geowiki's base directory [operations/puppet] - 10https://gerrit.wikimedia.org/r/94296 [00:12:56] (03PS1) 10QChris: Split geowiki data repository into private and public parts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94297 [00:12:57] (03PS1) 10QChris: Turn on generating geowiki's limn files [operations/puppet] - 10https://gerrit.wikimedia.org/r/94298 [00:12:58] (03PS1) 10QChris: Rsync geowiki's bare data-private repository to statistics' webservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/94299 [00:12:59] (03PS1) 10QChris: Checkout geowiki's data-private repo also on statistics' webservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/94300 [00:15:23] (03PS1) 10Chad: Move autosetuprebase to where it will actually do something useful [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94301 [00:15:32] <^d> Reedy: Lol ^ [00:17:48] (03PS1) 10Chad: Use descriptive heredoc [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94303 [00:21:43] !log csteipp synchronized php-1.23wmf3/includes 'bug 55332' [00:21:55] Logged the message, Master [00:21:58] !log Reloading zuul to enable Iaceb016cf7df20 [00:22:11] Logged the message, Master [00:27:42] PROBLEM - Host analytics1012 is DOWN: PING CRITICAL - Packet loss = 100% [00:29:03] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:30:52] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:34:19] Hm.. Ganglia's dashboards seem to be broken [00:34:19] https://ganglia.wikimedia.org/latest/tasseo.php?view_name=Navigation+Timing# [00:34:23] Uncaught TypeError: Cannot read property 'length' of null [00:35:12] [00:36:02] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [00:38:02] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [00:39:23] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [00:41:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [00:44:42] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [00:45:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [00:48:56] (03PS1) 10Dzahn: fix pdf servers in dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/94307 [00:50:26] ottomata: upgrading analytics? [00:55:45] What does the "sd" in sdtpa stand for? [00:56:52] switch&data [00:57:06] vendor-airport [00:57:34] k [00:57:46] I figured that from esams and equid, but couldn't find out which one it stood for [00:57:49] Krinkle: https://en.wikipedia.org/wiki/Switch_and_Data [00:57:54] acquired by Equinix [00:58:03] which is the eq in eqiad [00:58:17] yeah [00:59:42] PROBLEM - MySQL InnoDB on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:42] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:00:20] mutante: Hm.. So what does "pm" stand for? [01:00:24] <^d> power medium [01:00:30] :D [01:00:39] important facts are important [01:00:46] street name? [01:00:52] company name [01:01:02] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:01:06] [00:57:03] vendor-airport [01:01:07] put it in wikidata :p [01:01:18] ulsfo [01:01:21] united layer-sfo [01:01:27] knams [01:01:28] Reedy: Wait... pmtpa != sdtpa? [01:01:33] <^d> Same building. [01:01:35] kennisnet - esams [01:01:37] different floor [01:01:38] <^d> Different providers. [01:01:38] Different company [01:01:41] different floor [01:01:43] heh [01:01:56] same fibre that can be cut by a lawnmower [01:02:01] haha [01:02:02] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:28] <^d> yaseo -> crappy colo in seoul [01:02:33] RECOVERY - MySQL InnoDB on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:02:52] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:02:52] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [01:03:42] <^d> Reedy: Can I haz merge? [01:04:16] seo is not airport ? [01:04:30] ICN [01:04:32] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2540455 seconds since restart [01:04:50] Krinkle: This is all documented on Meta-Wiki, BTW. [01:04:53] <^d> Yeah probably should've been icn in hindsight :) [01:05:01] Elsie: Not on wikitech, that's where I was looking [01:05:12] at least switch&data and power medium were nowhere mentioned. [01:05:16] https://meta.wikimedia.org/wiki/Wikimedia_servers#Hosting [01:05:28] <^d> We should document the servers on wikitech. [01:05:43] mutante: https://en.wikipedia.org/wiki/S%C3%A9gu%C3%A9la_Airport is not particularly close to Seoul [01:05:53] ^d: That's been tried. It failed. [01:06:10] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:01] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:10] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [01:07:26] Elsie: ^d : wait. https://wikitech.wikimedia.org/w/index.php?limit=50&tagfilter=&title=Special%3AContributions&contribs=user&target=Dzahn&namespace=&tagfilter=&year=2013&month=10 [01:07:50] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2540653 seconds since restart [01:07:59] RoanKattouw: yea, indeed. didn't follow the convention [01:08:23] https://wikitech.wikimedia.org/wiki/Special:Log/delete/RobH [01:08:35] Though I thought there were more deletions... [01:09:04] (03PS3) 10Dr0ptp4kt: Load ZeroRatedMobileAccess only where currently supported. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94250 [01:09:40] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [01:09:44] Elsie: it failed because we had no good system for doing so [01:10:00] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [01:10:04] I thought there was some other inventory system that was being used. [01:10:08] PowerRack or something? [01:10:13] we're using racktables and we all hate t [01:10:15] *it [01:10:19] Ah, that was it. [01:10:21] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:28] So back to the wiki? :-) [01:10:43] I'd move it there, but I don't have time to set up a reasonable SMW system for it [01:10:59] I think others are looking at some other systems [01:11:14] i agree we shouldn't delete pages [01:11:23] they should be moved to something like archive [01:11:26] if there is the need to [01:11:38] we have an archive namespace, I think [01:11:42] but there are concerns that they show up in searches [01:11:55] which i think is a good thing [01:11:57] and that namespace is left out of the default search [01:12:10] as long as you can tell it's just history vs. current info [01:12:10] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [01:12:12] (03CR) 10Dr0ptp4kt: [C: 04-2] "Let's discuss whether 'wikipedia' or 'wiki' is more correct." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94250 (owner: 10Dr0ptp4kt) [01:12:18] Ryan_Lane: sounds good, yea [01:14:17] greg-g: for the code deploy dashboard... [01:14:40] trebuchet writes its deployment info into redis [01:14:58] if we change its schema some we can track each deployment separately by tag [01:15:19] as well as the deployment message that went along with it [01:15:26] then a dashboard could just read from redis [01:16:02] currently each deployment for a repo overwrites the data from the last, to make things simpler [01:16:30] hm. or does it? did I change that [01:16:35] I need to document the schema being used [01:17:05] (03CR) 10Chad: "It's wiki, not wikipedia." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94250 (owner: 10Dr0ptp4kt) [01:17:23] if we are not going to use wikitech, and we don't want google docs, and we dont want bugzilla and RT isn't enough and we cant agree on using dsh groups.. i'm at a loss here how to track this [01:17:27] bbl [01:17:51] hm, I should also put the schema/attribute mapping in a pillar so that it can be changed without needing to modify the code everywhere [01:18:07] mutante: how to track what? [01:18:10] servers? [01:19:02] yes, and specifically the ones left in tampa and which can be really decom [01:19:09] ah [01:19:14] salt grains? :) [01:19:26] you can add/remove them via a command [01:20:02] then you can also target them using salt [01:20:07] it wouldnt cover pdf and hardy and [01:20:09] (03CR) 10Ori.livneh: [C: 032] Update static-current symlinks to 1.23wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94277 (owner: 10Ori.livneh) [01:20:12] ugh [01:20:13] right [01:20:29] I forgot we still have hardy systems [01:20:51] well, this is not that outdated now https://wikitech.wikimedia.org/wiki/Tampa_cluster#Misc._Services_Pending_Migration [01:20:58] but i need to get food [01:22:17] <^d> Nobody ever e-mailed ops about formey decom. [01:22:28] <^d> I remember someone saying that the other day. [01:22:33] <^d> *ops list [01:26:00] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [01:26:42] (03CR) 10Catrope: [C: 032] Move autosetuprebase to where it will actually do something useful [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94301 (owner: 10Chad) [01:26:53] (03Merged) 10jenkins-bot: Move autosetuprebase to where it will actually do something useful [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94301 (owner: 10Chad) [01:27:20] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [01:33:01] !log catrope synchronized wmf-config/InitialiseSettings.php 'fix logo path for wikimania2014wiki' [01:33:01] !log adding temporary index to S5 wikidatawiki.wb_terms for slow queries [01:33:20] Logged the message, Master [01:33:36] Logged the message, Master [01:33:56] !log catrope synchronized docroot/bits/static-current/ 'update static-current symlinks to wmf3' [01:34:15] Logged the message, Master [01:34:36] RoanKattouw: oops, thanks [01:35:26] ori-l: Not your fault, +2ers are responsible for insta-deploying in mw-config [01:35:49] ori-l: Wait, you were the +2er. Sorry, it was your fault ;) [01:46:05] ori updated common to I1e6c5a3b8 [01:46:27] (^ testing) [01:53:15] <^d> !log tin: set git config branch.*.rebase true for all deployed branches [01:53:24] <^d> RoanKattouw_away: So not just fixed in future, fixed now ^ [01:53:31] Logged the message, Master [01:54:00] !log ori updated common to I1e6c5a3b8: Fix how 'current' branch is determined in updateBitsBranchPointers [01:54:19] Logged the message, Master [01:54:31] (^ also a test; I'll remove from the SAL) [02:03:13] (03PS1) 10Ori.livneh: Revert "Update RC2UDP config to use $wgRCFeeds" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94312 [02:03:45] (03Abandoned) 10Chad: Revert "Update RC2UDP config to use $wgRCFeeds" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94312 (owner: 10Ori.livneh) [02:04:56] (03PS1) 10Chad: Don't enable RC to UDP feeds for labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94313 [02:06:33] (03PS2) 10Chad: Don't enable RC to UDP feeds for labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94313 [02:06:45] (03CR) 10Chad: [C: 032 V: 032] Don't enable RC to UDP feeds for labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94313 (owner: 10Chad) [02:09:47] !log LocalisationUpdate completed (1.23wmf2) at Fri Nov 8 02:09:47 UTC 2013 [02:10:03] Logged the message, Master [02:14:41] Oh [02:14:47] Does labs set something else to false? [02:17:42] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [02:19:02] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [02:23:40] !log deployed Parsoid 67fca5bdc7 [02:23:58] Logged the message, Master [02:25:34] <^d> Reedy: It left $wgRC2UDPPrefix at the default of false. [02:26:15] <^d> You changed the trigger to be a new global $wmgUseRC2UDP, and didn't set it to false for labs. [02:29:00] !log LocalisationUpdate completed (1.23wmf3) at Fri Nov 8 02:28:59 UTC 2013 [02:29:16] Logged the message, Master [02:31:39] ^d: shouldn't beta labs be incapable of pushing to production udp in the first place? [02:31:48] <^d> One would think. [02:33:26] There's still that bug about beta labs sending emails to the new projects list [02:33:50] https://bugzilla.wikimedia.org/show_bug.cgi?id=48786 [02:33:58] is the sender address different ? [02:34:19] the list admin could just block that sender [02:34:23] in recipient filters [02:34:57] ah, already has that comment on it [02:35:52] "it is just about adapting the notifyNewProjects to [02:35:53] have it using a different email." [02:36:52] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [02:38:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 3.50 ms [02:49:02] PROBLEM - Host analytics1011 is DOWN: CRITICAL - Plugin timed out after 15 seconds [02:50:02] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [02:57:42] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [02:59:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [03:02:12] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [03:03:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 1.42 ms [03:06:03] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [03:09:13] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:21] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Nov 8 03:17:20 UTC 2013 [03:17:37] Logged the message, Master [03:30:14] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [03:32:13] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [03:32:13] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [03:34:03] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [03:41:54] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [03:43:23] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [04:06:41] (03CR) 10Chad: "Meh, don't have to symlink. But with all the cleanups it'll be way easier to just drop these in as-is." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [04:08:38] AaronSchulz: ping [04:25:38] (03CR) 10Yurik: [C: 031] Add an extra header for cache variance of W0 banners for proxies. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [04:34:54] (03PS1) 10Chad: nostalgiawiki gets Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94316 [04:36:35] (03CR) 10Chad: [C: 032] nostalgiawiki gets Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94316 (owner: 10Chad) [04:39:16] (03Merged) 10jenkins-bot: nostalgiawiki gets Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94316 (owner: 10Chad) [04:40:36] !log demon synchronized cirrus.dblist [04:40:51] Logged the message, Master [04:42:46] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [04:45:16] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [04:45:26] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [04:46:26] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [04:56:36] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [04:57:36] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [05:03:06] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [05:04:26] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [05:14:39] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [05:16:09] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [05:59:19] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [06:03:19] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:10:00] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [06:10:30] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [06:24:28] !log analytics1012 down, power mgmt firmware initialization error, opened ticket #6238 [06:24:50] Logged the message, Master [06:32:10] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [06:32:30] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [06:40:22] (03PS1) 10Ori.livneh: Add a githook for logging repo modification on tin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94319 [06:43:36] (03CR) 10Ori.livneh: [C: 032] Add a githook for logging repo modification on tin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94319 (owner: 10Ori.livneh) [06:45:02] !log ori updated /a/common to I70534e64e: nostalgiawiki gets Cirrus [06:45:18] Logged the message, Master [06:45:43] hmmm. not quite right. [06:51:39] (03PS1) 10Ori.livneh: Remove '--first-parent' arg from rev-list invocation in logmsg-git-hook [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94320 [06:51:40] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [06:51:58] (03CR) 10Ori.livneh: [C: 032] Remove '--first-parent' arg from rev-list invocation in logmsg-git-hook [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94320 (owner: 10Ori.livneh) [06:52:39] !log ori updated /a/common to I3691bbf3a: Remove '--first-parent' arg from rev-list invocation in logmsg-git-hook [06:52:59] Logged the message, Master [06:53:10] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [06:54:48] !log ori synchronized logmsg-git-hook 'logmsg-git-hook is meant only for tin, but syncing it for consistency' [06:55:04] Logged the message, Master [06:55:49] ori-l: is it reasonable to wrap the sha with {{Gerrit|sha}} for easy clicking on the SAL? [06:56:23] I thought about that, but remember the message goes to IRC & Twitter, too [06:56:34] and on Wikitech we have a Lua template that does the trick [06:56:35] yeah... [06:56:46] oh? [06:57:12] the one I botched by mucking up adminbot's hash regexp a while ago [06:57:42] oh right, so it's known that SAL doesn't show your last tests as linkified [06:58:32] i thought it would, tbh [06:58:39] let me look at the Lua template, sec [07:01:34] brr, I misremembered [07:01:44] you're right, the ones that are linkified come with {{Gerrit}} [07:01:54] i guess we should follow the convention of annotating for wikitext [07:01:59] so I'll take the suggestion [07:02:26] wait, [07:03:17] * greg-g waits [07:05:11] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [07:05:57] (03PS1) 10Ori.livneh: logmsg-git-hook: wrap Change-Ids in {{Gerrit|...}} [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94321 [07:06:16] (03CR) 10Ori.livneh: [C: 032] logmsg-git-hook: wrap Change-Ids in {{Gerrit|...}} [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94321 (owner: 10Ori.livneh) [07:06:16] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [07:06:16] PROBLEM - puppet disabled on analytics1011 is CRITICAL: Connection refused by host [07:06:45] : [07:06:46] :) [07:07:16] oohhhhh [07:07:16] RECOVERY - puppet disabled on analytics1011 is OK: OK [07:07:27] I missed a '}' [07:07:34] god damn it [07:07:45] obviously shouldn't be self merging [07:07:49] :P [07:08:29] (03PS1) 10Ori.livneh: Fix typo in logmsg-git-hook [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94322 [07:08:43] (03CR) 10Ori.livneh: [C: 032] Fix typo in logmsg-git-hook [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94322 (owner: 10Ori.livneh) [07:09:02] !log ori updated /a/common to {{Gerrit|I027abe363}}: Fix typo in logmsg-git-hook [07:09:06] wee [07:09:16] Logged the message, Master [07:09:38] (03PS1) 10ArielGlenn: add back thm1/2 to dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/94323 [07:10:05] that should come in handy, I hope [07:11:24] incremental improvements, always good [07:11:51] yes, commit + N typo fixes :) [07:11:52] (03CR) 10ArielGlenn: [C: 032] add back thm1/2 to dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/94323 (owner: 10ArielGlenn) [07:11:58] very incremental [07:12:09] hey, some of us take smaller steps than others [07:12:13] its ok [07:25:06] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [07:25:26] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 7.74 ms [08:15:02] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [08:16:32] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [08:21:40] morning [08:22:02] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [08:22:32] morning! [08:23:04] what's up with analytics? [08:23:22] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [09:12:01] (03PS5) 10Faidon Liambotis: Add all Asian countries in the list [operations/dns] - 10https://gerrit.wikimedia.org/r/80974 [09:12:02] (03PS1) 10Faidon Liambotis: Switch traffic back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/94325 [09:12:44] (03CR) 10Faidon Liambotis: [C: 032] Switch traffic back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/94325 (owner: 10Faidon Liambotis) [09:15:36] (03CR) 10Faidon Liambotis: [C: 032] Add all Asian countries in the list [operations/dns] - 10https://gerrit.wikimedia.org/r/80974 (owner: 10Faidon Liambotis) [09:39:07] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:27] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [09:43:37] PROBLEM - search indices - check lucene status page on search1003 is CRITICAL: Connection timed out [09:44:27] RECOVERY - search indices - check lucene status page on search1003 is OK: HTTP OK: HTTP/1.1 200 OK - 269 bytes in 0.001 second response time [09:44:27] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [09:46:27] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 2.63 ms [09:49:57] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:27] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [09:54:57] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [09:59:27] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 2.30 ms [10:27:14] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [10:28:32] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [10:43:57] (03PS4) 10Mark Bergsma: Repartition eqiad LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92343 [10:47:03] (03Abandoned) 10Mark Bergsma: Repartition esams LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92344 (owner: 10Mark Bergsma) [10:48:02] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [10:48:32] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [10:48:44] apergos: is this you? [10:57:31] (03PS5) 10Mark Bergsma: Repartition eqiad LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92343 [10:58:02] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [10:59:12] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 4.56 ms [11:05:59] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [11:09:09] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [11:18:51] what, analytics10**? nope [11:19:24] I only looked at 1012 because it was down for 5 hours according to icinga [11:20:15] (03PS6) 10Mark Bergsma: Repartition eqiad LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92343 [11:21:59] (03PS1) 10Mark Bergsma: Change IP of osm-lb.eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/94337 [11:22:26] (03CR) 10Mark Bergsma: [C: 032] Change IP of osm-lb.eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/94337 (owner: 10Mark Bergsma) [11:24:24] (03PS1) 10Mark Bergsma: Change osm-lb.eqiad IP address [operations/puppet] - 10https://gerrit.wikimedia.org/r/94339 [11:26:01] (03CR) 10Mark Bergsma: [C: 032] Change osm-lb.eqiad IP address [operations/puppet] - 10https://gerrit.wikimedia.org/r/94339 (owner: 10Mark Bergsma) [11:32:42] (03PS1) 10Mark Bergsma: Add reverse DNS for new upload-lb.eqiad IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/94341 [11:34:07] (03CR) 10Mark Bergsma: [C: 032] Add reverse DNS for new upload-lb.eqiad IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/94341 (owner: 10Mark Bergsma) [11:39:14] (03PS1) 10Mark Bergsma: Move parsoid-lb IPv6 address into the right new range [operations/dns] - 10https://gerrit.wikimedia.org/r/94342 [11:40:00] (03CR) 10Mark Bergsma: [C: 032] Move parsoid-lb IPv6 address into the right new range [operations/dns] - 10https://gerrit.wikimedia.org/r/94342 (owner: 10Mark Bergsma) [11:41:49] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:20] well these reboots are no good... 9 reboots since early this morning on analytics1014, 4 on analytics1011, 2 on analytics1013 [11:42:25] guess tht might be 3 now [11:43:09] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [11:43:35] yup [11:45:21] (03PS1) 10Mark Bergsma: Add new upload-lb.eqiad IP addresses according to the new Zero scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/94343 [11:46:24] (03CR) 10Mark Bergsma: [C: 032] Add new upload-lb.eqiad IP addresses according to the new Zero scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/94343 (owner: 10Mark Bergsma) [11:48:55] (03PS1) 10Mark Bergsma: Add the new upload-lb.eqiad IP addresses to the protoproxies [operations/puppet] - 10https://gerrit.wikimedia.org/r/94344 [11:49:59] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [11:50:30] (03CR) 10Mark Bergsma: [C: 032] Add the new upload-lb.eqiad IP addresses to the protoproxies [operations/puppet] - 10https://gerrit.wikimedia.org/r/94344 (owner: 10Mark Bergsma) [11:51:19] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [12:07:40] RECOVERY - DPKG on palladium is OK: All packages OK [12:08:00] RECOVERY - Puppetmaster HTTPS on palladium is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.074 second response time [12:19:21] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [12:20:30] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [12:27:10] PROBLEM - Backend Squid HTTP on sq80 is CRITICAL: Connection refused [12:37:20] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [12:39:20] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [12:40:15] (03PS1) 10Akosiaris: Disallow commit, merge, rebase on backends private [operations/puppet] - 10https://gerrit.wikimedia.org/r/94348 [12:47:50] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [12:48:21] (03CR) 10Akosiaris: [C: 032] Disallow commit, merge, rebase on backends private [operations/puppet] - 10https://gerrit.wikimedia.org/r/94348 (owner: 10Akosiaris) [12:50:10] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [12:50:28] so... I am thinking about temporarily sending all uslfo machines to palladium for puppet as test. Any objections ? [12:51:57] would you want to send one and run puppetd --test on it? or you have already done this? [12:56:14] (03PS1) 10Mark Bergsma: Create a backend_random director, and use it for login requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94350 [12:57:08] (03CR) 10jenkins-bot: [V: 04-1] Create a backend_random director, and use it for login requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94350 (owner: 10Mark Bergsma) [12:58:34] (03PS2) 10Mark Bergsma: Create a backend_random director, and use it for login requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94350 [12:58:41] (03CR) 10jenkins-bot: [V: 04-1] Create a backend_random director, and use it for login requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94350 (owner: 10Mark Bergsma) [13:12:44] apergos: yes i have already done that [13:12:55] nice [13:12:57] mangled cp4001 /etc/hosts [13:13:13] hmm I have no objections, if you are around to check on them in an hour just in case [13:13:22] it would be nice for them to "just work" [13:13:27] I 'll be around [13:13:36] yeah write... plug and pray ? [13:13:40] right* [13:13:42] :-D [13:13:48] man... what did I just write ? [13:13:56] right :-D [13:14:49] so yeah go to town, you can kick me in an hour and I'll run my 'check on puppet runs' script too, which is not guaranteed but will at least give an overview if something went badly awry [13:15:07] though [13:15:25] s/though// (taking it back) [13:29:23] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [13:29:52] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [13:30:27] well no console messages at the time of reboot, batches of cpu power messages from time to time earlier [13:30:48] no messages in hadoop log right at time of boot (prev msg was a few minutes earlier) [13:31:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [13:32:32] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:41] (03PS1) 10Akosiaris: Puppetmaster backends optimizations [operations/puppet] - 10https://gerrit.wikimedia.org/r/94352 [13:39:51] (03CR) 10Akosiaris: [C: 032] Puppetmaster backends optimizations [operations/puppet] - 10https://gerrit.wikimedia.org/r/94352 (owner: 10Akosiaris) [13:48:42] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:22] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [13:59:08] (03PS1) 10Arav93: Renamed $wmfConfigDir to $wmgConfigDir in mediawiki-config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94354 [13:59:47] (03CR) 10jenkins-bot: [V: 04-1] Renamed $wmfConfigDir to $wmgConfigDir in mediawiki-config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94354 (owner: 10Arav93) [14:21:37] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [14:22:37] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [14:22:37] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [14:22:44] (03PS1) 10Akosiaris: fix mpt-statusd mess [operations/puppet] - 10https://gerrit.wikimedia.org/r/94356 [14:24:27] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 1.88 ms [14:25:53] (03CR) 10Akosiaris: [C: 032] fix mpt-statusd mess [operations/puppet] - 10https://gerrit.wikimedia.org/r/94356 (owner: 10Akosiaris) [14:26:11] akosiaris: can you please file a Debian bug about this? [14:26:25] I'd argue in the same bug that it should be possible to just install mpt-status without a silly daemon [14:26:29] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=620186 [14:26:35] and that perhaps even the majority of users would want to do that [14:26:44] because they have icinga or some similar monitoring solution [14:26:46] someone already did and no answer since 2011 [14:27:01] ping it [14:27:02] it never hurts :) [14:42:07] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:27] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:27] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [14:43:37] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [14:45:29] hey apergos, do you know what's going on what the analytics nodes? [14:45:50] not really. which is to say, I looked into what was going on on nodes 1011-1014 [14:46:07] 1012 has an open ticket, it refused to come back up, see th log for that one, there's a ticket [14:46:30] power mgmt firmware initialization error, [14:46:47] now the rest of them have been rebooting all day, and here's what I know [14:47:04] I looked at atop, ganglia, etc for them, nothing really outstanding [14:47:31] I do see cpu power limit notifications from time to time on them all [14:47:45] (do we have cstate turned off on all nodes, any idea?) [14:47:56] don't know, ottoman might know [14:48:33] nothing useful in syslog, I camped on the console and waited for one host to go down, nothing on mgmt console except these power limit notifications from time to time, and not 2 seconds before the reboot either [14:48:37] s/ottoman/ottomata/ [14:49:08] I ran atop with 30 second updates to see if it saw anything (we log at 10 minute intervals)... nothing unusual there, which jives with that ganglia shows anyways [14:49:27] so not memory, there is maaaybe cpu wio [14:49:31] (says ganglia) [14:49:40] that's the only indicator I could find [14:50:00] I would see about the cstate setting just to rule that out, beyond that *no idea* [14:50:11] and I have to ask if you've seen this behavior on these nodes before [14:51:06] note that I was watching hadoop (well java) on that node too and it was using miniscule cpu too so... [14:51:55] these reboot patterns are quite odd, sometimes it's an hour or two, sometimes 5 mins later (!) [14:52:41] these are auto reboots? [14:53:14] we're not doing it, if that's the question [14:53:22] they're doing it all by themselves [14:53:47] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:02] I look for reboots from day before yesterday and I see none so that's interesting [14:54:39] and it's only those 3 hosts (don't know about 1012, since it's down) [14:55:09] anything that has changed in their setup in the last 1-2 days? maybe start there [14:55:17] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [14:56:20] if you want I can get you the time stamp of the first of these [14:56:31] ottomata we need you! [14:56:56] Nov 8 00:37:21 analytics1011 (well probably 1-3 minutes before that) [14:57:00] that's the start [14:57:09] UTC [14:57:25] hellooo [14:57:42] see scrollback 'what is happening on analytics nodes' and I will shut up now [14:58:00] i don't have scrollback [14:58:15] boooooo [14:59:17] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20131108.txt [14:59:38] start with [14:59:39] [14:42:07] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:41] and read on [14:59:57] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:00:47] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2590633 seconds since restart [15:00:52] (03PS1) 10Akosiaris: Fixing ownerships, permissions in various places [operations/puppet] - 10https://gerrit.wikimedia.org/r/94365 [15:01:45] (03CR) 10jenkins-bot: [V: 04-1] Fixing ownerships, permissions in various places [operations/puppet] - 10https://gerrit.wikimedia.org/r/94365 (owner: 10Akosiaris) [15:02:18] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [15:04:37] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:04:38] (03PS2) 10Akosiaris: Fixing ownerships, permissions in various places [operations/puppet] - 10https://gerrit.wikimedia.org/r/94365 [15:05:03] one thing that happened recently-ish, is that the java classes were cleaned up, but unless you weren't already using penjdk on analytics nodes... and then, why only those three (maybe 4) have the issue? [15:05:20] rebooting because of java changes ? [15:05:28] that be a sight to see [15:05:58] nothing is impossible of course, but [15:05:59] rebooting if the new java happened to tickle some kernel bug ... but why only those nodes? so [15:06:03] of puppet changes you mean, java didn't change, it was purely shuffling around puppet code [15:06:13] I don't really think so [15:07:33] however I have no other ideas either [15:08:08] are they on the same rack? [15:08:18] no idea [15:08:22] check? [15:08:34] I'm looking, sheesh [15:08:41] I can't check and type here at the same time :-P [15:08:55] well you said "no idea", not "checking" :) [15:09:00] using windows 8 ? [15:09:05] full screen apps ? [15:09:09] :P [15:09:12] I can't type in two windows at once [15:09:17] I can type in one place at once. [15:09:36] maybe you have four hands, or two very independent hands :-P [15:10:45] so those three (4) are in the same rack, but so are... 1015 - 1027 which have no issue [15:12:43] could be a PDU issue ? [15:13:16] no experience with that; how would we tell? [15:13:33] good question [15:22:19] (sorry, in standup, will respond shortly) [15:22:43] standup > outage? [15:23:24] haha [15:23:27] i guess so ? [15:23:32] o [15:23:32] ok [15:23:33] so [15:23:46] a11 − an27 are in the same rack [15:23:49] dunno, I'll try using it next time the site's down :P [15:23:59] an11-an13 are hadoop datanodes, and are also hadoop journalnodes [15:24:20] oh but an14 is going down too? [15:24:23] hm [15:24:51] (03CR) 10Andrew Bogott: [C: 032] Backup dynamicproxy-api's data.db [operations/puppet] - 10https://gerrit.wikimedia.org/r/94149 (owner: 10Andrew Bogott) [15:24:59] yes, 14 is the one I monitored closely and got nothing useful from :-/ [15:27:12] ok, then that probably rules that out, i thought maybe the fact that those three were journalnodes would make them different [15:27:22] but if an14 is going down then that's not it [15:35:21] (03PS1) 10Jgreen: add lutetium [operations/dns] - 10https://gerrit.wikimedia.org/r/94371 [15:35:22] (03CR) 10jenkins-bot: [V: 04-1] add lutetium [operations/dns] - 10https://gerrit.wikimedia.org/r/94371 (owner: 10Jgreen) [15:35:47] hm [15:35:50] gerrit's broken then [15:36:01] er, [15:36:04] jenkins I meant [15:37:54] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [15:39:14] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:39:53] (03CR) 10Jgreen: [C: 032 V: 031] "Overriding Jenkins manually on advice from security channel" [operations/dns] - 10https://gerrit.wikimedia.org/r/94371 (owner: 10Jgreen) [15:40:05] V+2 [15:40:51] uh hm [15:41:05] ./templates/wmnet:lutetium 1H IN A 10.65.6.13 it's already in there... [15:41:35] it has a public ip too? [15:42:22] sorry 10.64.40.111 for the non mgmt ip [15:42:28] one of those lines anyways :-/ [15:50:20] apergos: huh analytics1012 has been down for 15 hours? [15:50:25] i'm goign to try powercycle? [15:51:26] 1) read the log [15:51:29] 2) read the ticket [15:51:34] ottomata: [15:52:04] 3) tl;dr powercycled, it won't come up, ticket for chris to go phyisically intervene [15:52:08] *physically [15:52:42] !log authdns update to add A/PTR for lutetium public ip [15:52:44] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [15:52:53] whaa weird [15:53:01] Logged the message, Master [15:53:05] this is likely related to the reboots, right? [15:53:17] I can only hope so [15:53:23] if there's nothing in logs, possibly some some power issue? [15:53:27] because that means it's something physical [15:53:43] apergos, not sure how to check cstate setting [15:53:44] there is *nothing* in the logs. seriously. except the batchs of cpu power whine from time to time [15:53:44] googling [15:54:02] ah it's in the bios, if you haven't done it then you won't know if some machines have it disabled and some not [15:54:14] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [15:54:17] I thought you might have done the setup and install? [15:55:04] but the other possibility is that analytics1012 was rebooting for the same reasons as the other 3 and now won't come back because of some additional error (these things happen)... we'll find out soon I guess [15:55:48] yeh [15:56:48] (03CR) 10Ottomata: [C: 032 V: 032] Turn off geowiki monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/94290 (owner: 10QChris) [15:57:40] (03CR) 10Ottomata: [C: 032 V: 032] Move geowiki's name for research MySQL config into separate variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94291 (owner: 10QChris) [15:58:27] (03CR) 10Ottomata: [C: 032 V: 032] Move geowiki's name for globaldev MySQL config into separate variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94292 (owner: 10QChris) [15:59:21] (03CR) 10Ottomata: [C: 032 V: 032] Split geowiki paths in base path and scripts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94293 (owner: 10QChris) [16:00:17] (03CR) 10Ottomata: [C: 032 V: 032] Move geowiki scripts into geowiki's scripts subdirectory [operations/puppet] - 10https://gerrit.wikimedia.org/r/94294 (owner: 10QChris) [16:00:47] (03CR) 10Ottomata: [C: 032 V: 032] Rename geowiki backups to logs, as they are only logs [operations/puppet] - 10https://gerrit.wikimedia.org/r/94295 (owner: 10QChris) [16:01:13] (03CR) 10Ottomata: [C: 032 V: 032] Move geowiki data checkout into geowiki's base directory [operations/puppet] - 10https://gerrit.wikimedia.org/r/94296 (owner: 10QChris) [16:02:30] (03CR) 10Ottomata: [C: 032 V: 032] Split geowiki data repository into private and public parts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94297 (owner: 10QChris) [16:03:26] (03CR) 10Ottomata: [C: 032 V: 032] Turn on generating geowiki's limn files [operations/puppet] - 10https://gerrit.wikimedia.org/r/94298 (owner: 10QChris) [16:06:27] (03CR) 10Jgreen: [V: 032] "overriding jenkins-bot" [operations/dns] - 10https://gerrit.wikimedia.org/r/94371 (owner: 10Jgreen) [16:07:21] (03CR) 10Ottomata: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94299 (owner: 10QChris) [16:08:03] (03CR) 10ArielGlenn: [C: 032 V: 032] add lutetium [operations/dns] - 10https://gerrit.wikimedia.org/r/94371 (owner: 10Jgreen) [16:08:22] (03PS1) 10Jgreen: starting over with lutetium DNS change because gerrit hates me [operations/dns] - 10https://gerrit.wikimedia.org/r/94376 [16:10:14] (03CR) 10Ottomata: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94299 (owner: 10QChris) [16:11:22] (03CR) 10Ottomata: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94300 (owner: 10QChris) [16:12:35] (03Abandoned) 10Jgreen: starting over with lutetium DNS change because gerrit hates me [operations/dns] - 10https://gerrit.wikimedia.org/r/94376 (owner: 10Jgreen) [16:27:57] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:22] paravoid paravoooiiiiid [16:28:33] what to do about varnishkafka.log? [16:29:17] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:32:33] apergos: it's a pdu issue [16:33:00] one phase is down...could be a fuse...hoping not cuz i don't know how we're going to replace it [16:33:17] !log restarting ps1-c7-eqiad [16:33:35] Logged the message, Master [16:33:53] cmjohnson1: fuse swap is easy =] [16:34:03] wow you were able to tell [16:34:04] easier on the 1wide models [16:34:11] how, btw? so we know for the next time [16:34:16] how do you figure...it's on the inside of the pdu [16:34:21] server side [16:35:47] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:17] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [16:39:15] robh: y phase is at a flashing 0...not getting a FE though...still thinking it's fuse. Any thoughts? [16:41:07] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [16:43:17] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [16:57:59] if only one of the three phases has 0 [16:58:01] then yea its fuse [16:58:13] cmjohnson1: ^ sorry was afk cleaning up breakfast stuff [16:58:26] huzzah for redundant power! [16:59:11] having a bitch of time trying to figure it out though [16:59:17] so which pdu is this? [16:59:25] just curious, cuz some are much easier to get fuse out than others [16:59:34] (the 2 tower wide ones are a pain) [16:59:42] the 1 tower wide ones you dont even need any tools [16:59:53] the fuse box just pivots out of the pdu. [16:59:53] ps1-c7-eqiad [16:59:57] ahh, 2 wide [16:59:58] suck [17:00:04] those you have to remove the cover [17:00:11] and be careful to not shock the shit out of yourself. [17:00:15] (the plastic cover) [17:00:41] i suggest insulated needle nose pliars (easier than fitting fingers in there) [17:00:46] yeah but they're on the side of the pdu....i can remove from the pins that hold it on the rack but not much room and I still have to figure out which one it is [17:00:54] yep [17:00:59] its a pita [17:01:14] have to lift out of bracket and twist the entire tower to access [17:01:19] i do not envy you right now. [17:01:31] so there is an upside to us switching servertech models [17:01:42] thx...yes there is but that is only 6 cabs [17:01:43] this is much easier in the one tower wide model ;] [17:01:53] heh [17:02:06] i dont think we have any spare fuses either [17:02:14] you may have to pull and go to lowes [17:02:22] (i know i got replacement at lowes once) [17:02:36] though i dont recall if it was in tampa or there, i just purchased 2 of them [17:02:43] and i used both up over time, one in each site. [17:03:39] it was tampa..i did that once there as well [17:05:21] thanks for looking into this guys! [17:09:34] cmjohnson1: prolly, i know i purchased two or three of them [17:09:42] but i think i took the spare with me when i moved to ashburn [17:09:45] and used it there [17:10:06] one imagines any lowes would have it though (though we tend to be unlucky in electrical needs in ashburn lowes/homedepot ;) [17:10:33] dont use the open source fuse though! (thats when you just jam a length of wire in there ;) [17:11:20] robh: iirc if the fuse is blown it would be black..correct? [17:11:41] usually has some kind of scorch mark, and the middle metal thin wire will be gone [17:12:30] meh, fuses. [17:12:41] should be breakers! [17:13:47] There are a number of arguments on fuses being more precise than breakers, and having less power loss in transfer [17:14:24] but i have not seen any actual empirical evidence to support that. (though it somewhat makes sense i suppose on paper) [17:14:33] i just hate replacing fuses. [17:28:57] !log removing power to B Side on ps1-c7-eqiad [17:29:12] Logged the message, Master [17:32:54] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [17:33:55] hrmm [17:34:06] cmjohnson1: ^ coincidence? [17:34:59] that's been up and down [17:35:14] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [17:36:12] 1011-1014 [17:36:35] there' plenty about these in the scrollback... [17:36:54] more than anyone wants, probably [17:39:24] robh B XY and B YZ are still jacked ..B XZ is fine [17:39:48] hrmm, two of them are messed. [17:39:50] B XZ has always been fine [17:39:55] thats odd. [17:40:08] just B side Y....the fuses I removed did not looked bad [17:40:13] its normal to see say XY borked and other two fine [17:40:17] but not for two [17:40:19] usually there is an indicator it blew [17:40:26] You may need to call servertech support =[ [17:40:45] so i have not seen an entire phase go bad [17:40:50] (like now, all Y is borked) [17:40:51] yippie...the fist thing they'll want is a fw update...which I've yet to figure out how to do [17:41:07] i've only experienced a single fuse messing up a circiut group (xy, yz, xz) [17:41:28] heh, if you8 cannot get it workin i can help for that [17:41:36] we'll get an ftp server running on yer laptop [17:41:50] * RobH checks if os x does that natively anyhow [17:41:52] nope! [17:42:14] no but you can do it http://osxdaily.com/2011/09/29/start-an-ftp-or-sftp-server-in-mac-os-x-lion/ [17:42:19] so you can install macports (complex) or just spin up a linux machine [17:42:25] or that [17:42:28] wow [17:42:36] uhh, yes [17:42:40] do that. [17:42:45] you dont want sftp [17:42:49] haha, RobH just said installing macports was more complex than seting up a linux machine. [17:42:49] servertechs cannot handle that [17:42:55] greg-g: it is damn it [17:43:00] heh [17:43:02] I've never touched macports [17:43:09] its great [17:43:13] but complex [17:43:15] but if you have an issue, its a bit complex. [17:43:21] indeed [17:43:26] where a linux vm is simple. [17:43:33] :) [17:43:45] says the dude who uses os x desktop [17:43:51] i know i am a walking contridiction. [17:44:07] and now a bunch of us have a song in our head [17:44:36] https://www.youtube.com/watch?v=I5zEP4kvfnc for everyone else [17:44:38] You are welcome. [17:44:56] its the bay area, isnt everyone required to like green day? [17:45:04] I.. haven't heard that [17:45:23] I think it's more "you must like skinny jeans and some band no one else has heard of" [17:45:38] 'course, I'm the one who moved to Petaluma, so don't take my opinion for much ;) [17:45:42] i dunno, dont they have a couple restaraunts here? [17:45:46] like insanely popular ones? [17:45:57] that i have not gone to cuz i dislike waitign in lines [17:46:04] PROBLEM - Host ps1-c7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:48:14] brew > macports [17:48:24] RECOVERY - Host ps1-c7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.69 ms [17:48:24] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:20] <^d> bd808: I didn't know anyone still used macports. [17:49:44] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [17:50:08] meh, still works for me. [17:50:16] but im not a dev, i dont use a LOT of things in it. [17:50:27] and i have my own linux server in a datacenter for when i wanna do linux stuffs [17:50:48] <^d> I still have nightmares about recompiling gcc all over again [17:50:50] <^d> And again [17:50:51] <^d> And again [17:51:12] <^d> f'ing macports. [17:51:21] robh: what type of cdu's are these? [17:51:33] do you know off hand/ [17:53:08] ^d: former gentoo user? :-P [17:53:29] cmjohnson: i do, not offhand [17:53:30] lemme check a ticket [17:53:48] CS-84VDY-L2130 [17:53:55] cmjohnson: ^ [17:54:03] those are the white two tower models [17:54:08] awesome thx ... [17:56:36] ottomata: did anything happen leading up to the power failure? [17:57:37] <^d> apergos: Heh, no [18:09:11] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:41] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [18:20:24] page blow up:https://en.wikinews.org/wiki/Wikinews:Water_cooler/technical [18:20:29] Reedy: this you? ^ [18:20:39] urgh. I get a 503 on en.wikipedia.org [18:21:01] PROBLEM - SSH on db1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:01] PROBLEM - Apache HTTP on mw1102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:01] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:11] PROBLEM - MySQL Slave Delay on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:21:11] PROBLEM - Full LVS Snapshot on db1050 is CRITICAL: Timeout while attempting connection [18:21:11] PROBLEM - MySQL Idle Transactions on db1050 is CRITICAL: Timeout while attempting connection [18:21:11] PROBLEM - mysqld processes on db1050 is CRITICAL: Timeout while attempting connection [18:21:11] PROBLEM - Apache HTTP on mw1076 is CRITICAL: Connection timed out [18:21:12] PROBLEM - Apache HTTP on mw1087 is CRITICAL: Connection timed out [18:21:12] PROBLEM - MySQL disk space on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:21:13] PROBLEM - Disk space on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:21:14] .... [18:21:15] dangittt [18:21:20] well, thats not good [18:21:21] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:21:21] PROBLEM - MySQL InnoDB on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:21:22] PROBLEM - MySQL Slave Running on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:21:22] PROBLEM - MySQL Processlist on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:21:22] PROBLEM - Apache HTTP on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:25] oh joy [18:21:31] PROBLEM - MySQL Recent Restart on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:21:37] So is this deployment related? [18:21:41] PROBLEM - DPKG on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:21:41] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:41] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:41] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:41] PROBLEM - Apache HTTP on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:42] PROBLEM - Apache HTTP on mw1072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:42] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:43] PROBLEM - Apache HTTP on mw1113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:43] PROBLEM - Apache HTTP on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:44] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:44] PROBLEM - Apache HTTP on mw1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:45] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:45] PROBLEM - RAID on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:21:46] PROBLEM - Apache HTTP on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:46] PROBLEM - Apache HTTP on mw1214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:47] PROBLEM - Apache HTTP on mw1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:47] PROBLEM - Apache HTTP on mw1103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:48] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:48] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:50] greg-g: no deployments on Fridays? :P [18:21:52] PROBLEM - Apache HTTP on mw1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:53] PROBLEM - Apache HTTP on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:53] PROBLEM - Apache HTTP on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:53] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:53] PROBLEM - Apache HTTP on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:53] PROBLEM - Apache HTTP on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:54] PROBLEM - Apache HTTP on mw1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:54] PROBLEM - Apache HTTP on mw1183 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:55] PROBLEM - Apache HTTP on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:55] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:56] PROBLEM - Apache HTTP on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:56] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:57] PROBLEM - Apache HTTP on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:57] PROBLEM - Apache HTTP on mw1180 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:58] PROBLEM - Apache HTTP on mw1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:58] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:01] PROBLEM - Apache HTTP on mw1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:01] PROBLEM - Apache HTTP on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:01] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:01] PROBLEM - Apache HTTP on mw1212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:01] PROBLEM - Apache HTTP on mw1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:02] PROBLEM - Apache HTTP on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:02] PROBLEM - Apache HTTP on mw1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:06] RobH: I have no idea, I just bother reedy first for everything [18:22:10] wtf [18:22:12] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:12] PROBLEM - puppet disabled on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:22:12] PROBLEM - Apache HTTP on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:12] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:12] PROBLEM - Apache HTTP on mw1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:12] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:12] PROBLEM - Apache HTTP on mw1166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:13] PROBLEM - Apache HTTP on mw1051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:13] PROBLEM - Apache HTTP on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:14] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:14] PROBLEM - Apache HTTP on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:15] PROBLEM - Apache HTTP on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:15] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:16] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:16] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:17] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:17] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:18] PROBLEM - Apache HTTP on mw1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:19] Ditto greg-g [18:22:21] PROBLEM - Apache HTTP on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:21] PROBLEM - Apache HTTP on mw1164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:21] PROBLEM - Apache HTTP on mw1163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:21] PROBLEM - Apache HTTP on mw1188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:21] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:22:24] yay [18:23:03] first response, pull up deployment cal.. oh wait thats on cluster. [18:23:09] https://wikitech.wikimedia.org/wiki/Server_admin_log [18:23:28] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:28] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:28] PROBLEM - Apache HTTP on mw1133 is CRITICAL: Connection timed out [18:23:29] Request: POST http://it.wikipedia.org/wiki/Speciale:Blocca/109.52.157.239, from 208.80.152.16 via sq75.wikimedia.org (squid/2.7.STABLE9) to 208.80.152.81 (208.80.152.81) [18:23:29] Error: ERR_READ_TIMEOUT, errno [No Error] at Fri, 08 Nov 2013 18:23:13 GMT [18:23:31] on it.wiki [18:23:34] last thing is cutting power in eqiad, sure you pulled the right cable cmjohnson ? :) [18:23:41] PROBLEM - Apache HTTP on mw1144 is CRITICAL: Connection timed out [18:23:47] but that was 55min ago :) [18:23:51] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:54] its an isolated rack [18:23:57] didn't pull any cable [18:24:03] cmjohnson's power work shouldnt touch this stuff anyhow [18:24:10] those app servers are still reachable and have apache running [18:24:12] just checking... mistakes happen [18:24:15] unless the analytics rack is now part of central site structure ;] (yep) [18:24:16] at least the ones i checked [18:24:19] aren't all mw1xxx in eqiad? [18:24:23] <^d> Yes [18:24:24] yes. [18:24:26] site is served in eqiad [18:24:31] so any site outages are in eqiad ;] [18:24:31] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.047 second response time [18:24:41] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.955 second response time [18:24:41] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [18:24:43] ori-l yes but the failures are across several racks [18:24:44] hahaha [18:24:47] fix yourself cluster! [18:24:50] keep 'em coming.. [18:24:51] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.725 second response time [18:24:51] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.544 second response time [18:24:51] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:24:51] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.316 second response time [18:24:55] IT LIVES [18:25:01] * greg-g waits for all the spam [18:25:01] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.060 second response time [18:25:01] I don't see anything funny [18:25:02] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [18:25:02] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [18:25:02] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.094 second response time [18:25:02] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.468 second response time [18:25:02] PROBLEM - Apache HTTP on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:02] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:04] YAY [18:25:05] zombie cluster [18:25:06] more more more [18:25:10] seriously though [18:25:12] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [18:25:12] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [18:25:12] PROBLEM - Apache HTTP on mw1191 is CRITICAL: Connection timed out [18:25:12] PROBLEM - Apache HTTP on mw1201 is CRITICAL: Connection timed out [18:25:12] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [18:25:12] <^d> RobH: That new self-healing software is awesome. [18:25:14] need to know what caused that. [18:25:18] <^d> We should've bought this years ago. [18:25:21] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:24] can people please get the fun chit chat out of this channel? [18:25:31] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:31] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:41] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:41] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:41] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:51] PROBLEM - Apache HTTP on mw1124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:51] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:51] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:51] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:25:51] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:26:09] https://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [18:26:11] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.450 second response time [18:26:30] so those are all api [18:26:31] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:26:32] PROBLEM - LVS HTTP IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:26:33] that fell over [18:26:44] 1196-1202 [18:27:11] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:12] PROBLEM - Apache HTTP on mw1203 is CRITICAL: Connection timed out [18:27:21] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:21] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:21] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:21] PROBLEM - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:25] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.920 second response time [18:27:25] PROBLEM - Apache HTTP on mw1206 is CRITICAL: Connection timed out [18:27:25] <^d> RobH: It's not just APIs. I'm getting 503s from normal page loads [18:27:31] PROBLEM - LVS HTTPS IPv6 on wikidata-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:32] fwiw, site works for me right now [18:27:33] yea [18:27:34] Just got a 504 [18:27:36] RobH: 1208 too [18:27:41] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:41] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:44] it's both api and appservers [18:27:51] PROBLEM - Apache HTTP on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:51] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:58] <^d> LVS is complaining too. Wondering if the problem's there and the apaches are just fine. [18:28:01] started at 18:17 [18:28:01] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:01] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:01] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:11] PROBLEM - LVS HTTP IPv6 on wikivoyage-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.762 second response time [18:28:11] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:11] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:11] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:11] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:12] lvs complains when appaches fail and it cannot depool more of them usually [18:28:18] <^d> (Every apache I've tried sshing too looks sane) [18:28:21] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.941 second response time [18:28:21] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:21] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:21] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:21] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:22] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:28] same. [18:28:31] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:31] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:31] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:31] PROBLEM - Apache HTTP on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:31] PROBLEM - Apache HTTP on mw1118 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:32] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:32] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:33] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:33] the appservers themselves look quite normal though, ack ^D [18:28:33] it'd be either database or memcached [18:29:09] I'd go with memcached [18:29:09] 2013-11-08 18:28:57 mw65 enwiki: Memcached error for key "enwiki:messages:en:status" on server "127.0.0.1:11211": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [18:29:09] 2013-11-08 18:28:57 mw65 enwiki: Memcached error for key "enwiki:lag_times:db1056" on server "127.0.0.1:11211": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [18:29:09] 2013-11-08 18:28:57 mw65 enwiki: Memcached error for key "enwiki:lag_times:db1056:lock" on server "127.0.0.1:11211": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [18:29:09] 2013-11-08 18:28:57 mw65 enwiki: Memcached error for key "enwiki:lag_times:db1056" on server "127.0.0.1:11211": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [18:29:10] 2013-11-08 18:28:57 mw65 enwiki: Memcached error for key "enwiki:page-lastedit:8ef90d6fe2da950ea6e365958d8d26a7" on server "127.0.0.1:11211": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [18:29:11] RECOVERY - LVS HTTP IPv6 on wikivoyage-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 45363 bytes in 0.424 second response time [18:29:20] that happens [18:29:21] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:29:22] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:29:28] Oh [18:29:28] No [18:29:29] Fri Nov 8 18:29:21 UTC 2013 mw1019 enwiki Error connecting to 10.64.16.145: :real_connect(): (HY000/2003): Can't connect to MySQL server on '10.64.16.145' (110) [18:29:29] supporting infrastructure for memcached is fine [18:29:30] Fri Nov 8 18:29:21 UTC 2013 mw1218 enwiki Error connecting to 10.64.16.32: :real_connect(): (08004/1040): Too many connections [18:29:32] Fri Nov 8 18:29:21 UTC 2013 mw1011 enwiki Error connecting to 10.64.16.32: :real_connect(): (08004/1040): Too many connections [18:29:34] Fri Nov 8 18:29:21 UTC 2013 mw1011 enwiki Error connecting to 10.64.16.144: :real_connect(): (08004/1040): Too many connections [18:29:35] dbtree doesn't show obvious problems [18:29:35] db1050 is down [18:29:36] Fri Nov 8 18:29:21 UTC 2013 mw1138 enwiki Error connecting to 10.64.16.32: :real_connect(): (08004/1040): Too many connections [18:29:39] ahh [18:29:40] 10.64.16.145 [18:29:43] db1021 is unhappy [18:29:55] oh, it does, black, i looked for red [18:29:58] a few others unhappy too [18:30:01] PROBLEM - Varnish HTTP text-frontend on amssq50 is CRITICAL: Connection timed out [18:30:09] need me to try to track down anyone in particular to help? [18:30:13] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.862 second response time [18:30:14] springle-afk: [18:30:22] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:25] one down slave db [18:30:31] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.317 second response time [18:30:31] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:31] PROBLEM - LVS HTTP IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:32] shouldnt really cause all this though... [18:30:36] Eloquence: I'll tell you in a sec, thanks [18:31:00] I have a paid cell so if someone needs to make a long distance call to sean i can. [18:31:11] RECOVERY - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64540 bytes in 3.052 second response time [18:31:16] Reedy: memcached has been wonked for months [18:31:27] AaronSchulz: Lots of spam in the srs log then [18:31:31] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.935 second response time [18:31:32] PROBLEM - LVS HTTP IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 3309 bytes in 0.207 second response time [18:31:32] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2951 bytes in 3.209 second response time [18:31:39] srs? [18:31:41] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.821 second response time [18:31:41] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.130 second response time [18:31:41] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.058 second response time [18:31:41] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.036 second response time [18:31:41] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [18:31:45] memcached-serious, he means [18:31:46] databases are recovering [18:31:51] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.031 second response time [18:31:51] db1050 still down [18:31:51] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.321 second response time [18:31:51] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [18:31:51] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [18:32:01] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.024 second response time [18:32:01] PROBLEM - Varnish HTTP text-frontend on amssq52 is CRITICAL: Connection timed out [18:32:01] PROBLEM - Varnish HTTP text-frontend on amssq57 is CRITICAL: Connection timed out [18:32:01] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [18:32:01] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [18:32:02] bah, when it self recovers it makes it hard(er) to see what caused it [18:32:02] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.077 second response time [18:32:02] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [18:32:03] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [18:32:11] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [18:32:11] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [18:32:11] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [18:32:11] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.383 second response time [18:32:16] Did anyone do anything? [18:32:18] paravoid: i think i should powercycle it , it's frozen on mgmt [18:32:21] hmm, lots of DB connection errors [18:32:21] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [18:32:21] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.088 second response time [18:32:21] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.085 second response time [18:32:21] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.449 second response time [18:32:21] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.085 second response time [18:32:22] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.639 second response time [18:32:26] mutante: I was just in [18:32:27] mutante: please do [18:32:31] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.639 second response time [18:32:31] PROBLEM - Varnish HTTP text-frontend on amssq56 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:31] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:31] RECOVERY - LVS HTTP IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64540 bytes in 1.778 second response time [18:32:31] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.652 second response time [18:32:41] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [18:32:42] AaronSchulz: Yup, we'd already got that ;) [18:32:51] RECOVERY - Varnish HTTP text-frontend on amssq57 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 3.743 second response time [18:32:55] !log powercycling frozen db1050 [18:33:15] Logged the message, Master [18:33:21] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:21] RECOVERY - Frontend Squid HTTP on cp1001 is OK: HTTP OK: HTTP/1.0 200 OK - 1293 bytes in 0.002 second response time [18:33:22] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.076 second response time [18:33:31] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.049 second response time [18:33:32] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:32] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.784 second response time [18:33:41] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.695 second response time [18:33:41] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.230 second response time [18:33:41] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.051 second response time [18:33:52] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [18:33:52] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.434 second response time [18:33:52] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.050 second response time [18:33:52] RECOVERY - Varnish HTTP text-frontend on amssq52 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 1.430 second response time [18:33:53] PROBLEM - LVS HTTP IPv4 on wikidata-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:57] PROBLEM - LVS HTTP IPv6 on wikidata-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:01] PROBLEM - LVS HTTP IPv6 on wikidata-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.211 second response time [18:34:01] RECOVERY - Varnish HTTP text-frontend on amssq50 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 9.665 second response time [18:34:01] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:11] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.068 second response time [18:34:11] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [18:34:12] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.049 second response time [18:34:12] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [18:34:12] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.049 second response time [18:34:12] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.053 second response time [18:34:12] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [18:34:13] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.047 second response time [18:34:13] PROBLEM - NTP on db1050 is CRITICAL: NTP CRITICAL: No response from NTP server [18:34:21] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.750 second response time [18:34:21] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [18:34:21] RECOVERY - Varnish HTTP text-frontend on amssq56 is OK: HTTP OK: HTTP/1.1 200 OK - 199 bytes in 0.205 second response time [18:34:21] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.900 second response time [18:34:21] RECOVERY - LVS HTTP IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64404 bytes in 0.895 second response time [18:34:22] RECOVERY - LVS HTTPS IPv6 on wikidata-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1013 bytes in 0.289 second response time [18:34:27] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [18:34:27] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [18:34:29] paravoid: it seems dead, no output after cycle [18:34:31] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.721 second response time [18:34:31] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [18:34:32] oh wait [18:34:39] hold on, it's booting now [18:34:41] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.443 second response time [18:34:41] RECOVERY - Apache HTTP on mw1067 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.354 second response time [18:34:41] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.619 second response time [18:34:41] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.940 second response time [18:34:41] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.639 second response time [18:34:42] RECOVERY - LVS HTTP IPv4 on wikidata-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.0 301 Moved Permanently - 592 bytes in 0.078 second response time [18:34:45] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [18:34:46] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [18:34:46] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.086 second response time [18:34:46] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [18:34:46] RECOVERY - LVS HTTP IPv6 on wikidata-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1018 bytes in 0.072 second response time [18:34:47] well, now ganglia is getting info from the apaches it wasnt. [18:34:50] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.791 second response time [18:34:51] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.511 second response time [18:34:51] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.592 second response time [18:34:51] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.622 second response time [18:34:51] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.667 second response time [18:34:51] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:52] RECOVERY - LVS HTTP IPv6 on wikidata-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 969 bytes in 0.372 second response time [18:34:55] Configuring memory.Please wait ... [18:35:01] RECOVERY - Apache HTTP on mw1212 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.048 second response time [18:35:01] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [18:35:01] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [18:35:01] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.077 second response time [18:35:01] RECOVERY - LVS HTTP IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64404 bytes in 0.915 second response time [18:35:11] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [18:35:11] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [18:35:12] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.060 second response time [18:35:12] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [18:35:12] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [18:35:12] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.088 second response time [18:35:12] RECOVERY - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1021 bytes in 0.188 second response time [18:35:16] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.226 second response time [18:35:16] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.689 second response time [18:35:16] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.452 second response time [18:35:16] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.294 second response time [18:35:21] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.320 second response time [18:35:21] RECOVERY - Apache HTTP on mw1211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.405 second response time [18:35:21] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.861 second response time [18:35:21] RECOVERY - Apache HTTP on mw1173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.584 second response time [18:35:21] RECOVERY - Apache HTTP on mw1068 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.053 second response time [18:35:22] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [18:35:22] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.058 second response time [18:35:23] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.080 second response time [18:35:23] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.050 second response time [18:35:24] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.069 second response time [18:35:24] RECOVERY - Apache HTTP on mw1210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.051 second response time [18:35:25] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.068 second response time [18:35:25] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.073 second response time [18:35:26] RECOVERY - Apache HTTP on mw1089 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.069 second response time [18:35:26] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [18:35:27] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.051 second response time [18:35:27] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [18:35:31] RECOVERY - LVS HTTP IPv6 on wikidata-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 593 bytes in 4.244 second response time [18:35:32] ehm... so unrelated to db1050 ? [18:35:35] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.080 second response time [18:35:35] PROBLEM - Backend Squid HTTP on cp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:35:35] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.051 second response time [18:35:35] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [18:35:35] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [18:35:36] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.045 second response time [18:35:36] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [18:35:37] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.067 second response time [18:35:37] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [18:35:40] or it was pushed over due to them [18:35:41] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.653 second response time [18:35:41] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [18:35:50] but seems odd that it could cause as its a replication slave. [18:35:51] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.051 second response time [18:35:56] no, the timestamp correlates exactly [18:36:01] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.653 second response time [18:36:10] Is 1050 in a special group maybe? [18:36:12] PROBLEM - Host db1050 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:12] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [18:36:12] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.735 second response time [18:36:14] the other slaves in the same shard had a huge load spike at the exact same time, though [18:36:15] but we see recoveries before it's even back up [18:36:19] Like one of those ones that has special indexes for RC queries or whatever? [18:36:22] PROBLEM - Frontend Squid HTTP on cp1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:22] PROBLEM - Backend Squid HTTP on cp1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:22] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [18:36:22] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [18:36:22] RECOVERY - Apache HTTP on mw1215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.051 second response time [18:36:23] RECOVERY - Backend Squid HTTP on cp1006 is OK: HTTP OK: HTTP/1.0 200 OK - 1249 bytes in 0.002 second response time [18:36:24] paravoid: i'd think that it would fall over as a symptom rather than a cause, but dunno [18:36:31] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:31] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.046 second response time [18:36:36] it's weighted at 100 (compared to the rest), [18:36:37] i bet sean would know [18:36:40] and it's not in a special group no [18:36:41] RECOVERY - Apache HTTP on mw1072 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.218 second response time [18:36:45] or apergos [18:36:46] heh [18:37:04] what's the scene? [18:37:05] db snaps, that's it. [18:37:08] back up? [18:37:12] should be, yes. [18:37:13] RECOVERY - Frontend Squid HTTP on cp1002 is OK: HTTP OK: HTTP/1.0 200 OK - 1283 bytes in 0.004 second response time [18:37:24] paravoid: Did you do anything to the dbs? [18:37:28] no [18:37:29] any chance someone got a show full processlist? [18:37:41] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.306 second response time [18:37:41] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:45] apergos, it didn't connect for me [18:37:45] I wasnt on them, was checkign links for saturation in memcached rack, heh [18:37:51] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.297 second response time [18:37:51] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.018 second response time [18:37:51] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [18:38:01] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:01] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=MySQL+eqiad&h=db1052.eqiad.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [18:38:12] RECOVERY - Backend Squid HTTP on cp1007 is OK: HTTP OK: HTTP/1.0 200 OK - 1249 bytes in 0.002 second response time [18:38:12] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.053 second response time [18:38:12] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.010 second response time [18:38:12] RECOVERY - Apache HTTP on mw1186 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.041 second response time [18:38:12] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.049 second response time [18:38:12] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.948 second response time [18:38:12] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.959 second response time [18:38:18] /dev/sda1: clean, 90571/2444624 files, 864301/9765625 blocks [18:38:19] The disk drive for /a is not ready yet or not present. [18:38:19] Continue to wait, or Press S to skip mounting or M for manual recovery [18:38:24] So something slammed s1, toppling one of its slaves [18:38:39] nice spike [18:38:41] (im asking?) [18:38:44] not stating fact. [18:38:46] Quick question: are we actually recovering or are we cycling up/down? [18:38:53] (impacts the messaging) [18:38:59] continue to wait mutante [18:39:08] need /a or we're dead in the water [18:39:31] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [18:39:34] waits [18:40:04] =/ [18:40:45] so we look recovered but we don't know what caused it. correct? [18:40:54] (03PS1) 10Faidon Liambotis: Depool db1050, down, disk failed [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94398 [18:41:21] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [18:41:21] RECOVERY - Host db1050 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:41:22] manybubbles: thus far. [18:42:04] (03PS2) 10QChris: Rsync geowiki's bare data-private repository to statistics' webservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/94299 [18:42:05] (03PS2) 10QChris: Checkout geowiki's data-private repo also on statistics' webservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/94300 [18:42:11] RECOVERY - NTP on db1050 is OK: NTP OK: Offset -0.0009763240814 secs [18:42:24] paravoid: if you didnt merge that yet [18:42:31] oh, that sjust ping [18:42:31] nm [18:42:37] oh, and ntp.... [18:43:03] won't hurt anything to depool it, capacity wise [18:43:03] meh, doesnt matter though [18:43:06] its up.. [18:43:06] indeed [18:43:10] its not fully up [18:43:18] what does fully mean? [18:43:23] its ssh isnt and mutante is in serial console, so its not up [18:43:27] its not responsive to ssh [18:43:28] ok [18:43:31] so its not up [18:43:46] i spoke to soon when i pinged faidon. [18:43:54] im just working with the static pages right now, editing stuff [18:43:58] and yet you keep pinging me [18:44:00] not working on the dev side currently [18:44:02] (03CR) 10QChris: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94300 (owner: 10QChris) [18:44:16] i said yer name not irc tag [18:44:24] not my fault ya ping on everything ;] [18:44:38] (03CR) 10Faidon Liambotis: [C: 032] Depool db1050, down, disk failed [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94398 (owner: 10Faidon Liambotis) [18:44:39] http requests wrking fine [18:45:00] !log faidon updated /a/common to {{Gerrit|Ic66d0e783}}: Depool db1050, down, disk failed [18:45:10] oh, that's new [18:45:14] yes [18:45:16] Logged the message, Master [18:45:40] !log faidon synchronized wmf-config/db-eqiad.php 'depool db1050' [18:45:51] Logged the message, Master [18:45:58] (03CR) 10QChris: "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94299 (owner: 10QChris) [18:47:05] I have pt-kills running on all s1 shards, nothing so far [18:47:31] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [18:47:35] er, not all, a few of them [18:48:21] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [18:48:44] db1050 still sits at fsck [18:48:57] tell me if you want me to just skip mounting [18:49:03] yeah, please do [18:49:16] done, it's at login now [18:49:30] yep, can ssh to it [18:49:31] RECOVERY - MySQL Recent Restart on db1050 is OK: OK seconds since restart [18:49:31] RECOVERY - DPKG on db1050 is OK: All packages OK [18:49:31] RECOVERY - RAID on db1050 is OK: OK: optimal, 1 logical, 2 physical [18:49:37] optimal? [18:49:51] RECOVERY - SSH on db1050 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:50:11] RECOVERY - puppet disabled on db1050 is OK: OK [18:50:11] RECOVERY - MySQL Slave Delay on db1050 is OK: OK replication delay 0 seconds [18:50:11] RECOVERY - MySQL Slave Running on db1050 is OK: OK replication [18:50:12] RECOVERY - MySQL Idle Transactions on db1050 is OK: OK longest blocking idle transaction sleeps for seconds [18:50:12] RECOVERY - Full LVS Snapshot on db1050 is OK: OK no full LVM snapshot volumes [18:50:12] RECOVERY - Disk space on db1050 is OK: DISK OK [18:50:12] RECOVERY - MySQL disk space on db1050 is OK: DISK OK [18:50:12] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay 0 seconds [18:50:16] !log db1050 back up after skipping mount of failed /a [18:50:30] Logged the message, Master [18:51:02] eh, yea, sub-optimal [18:51:09] the check [18:53:24] root@db1050:/a# mount [18:53:24] /dev/sda1 on / type ext3 (rw,errors=remount-ro) [18:53:31] it's mounted ro [18:53:41] no it's not [18:54:43] Nov 8 18:49:08 db1050 kernel: [ 65.435387] ACPI Error: No handler for Region [IPMI] (ffff880ffbc55240) [IPMI] (20110623/evregion-373) [18:54:46] nope [18:54:50] something's fucked with lv [18:54:51] lvm [18:54:55] oh [18:55:21] Nov 8 18:49:08 db1050 kernel: [ 65.721927] device-mapper: table: 252:3: snapshot: Snapshot cow pairing for exception table handover failed [18:55:31] Nov 8 18:49:08 db1050 kernel: [ 65.731665] device-mapper: ioctl: error adding target to table [18:55:41] yeah all that [18:56:04] besides those error messages [18:56:11] start elsewhere [18:56:14] so /a couldn't be mounted [18:56:17] so cat /etc/fstab [18:56:28] /dev/mapper/tank-data is the disk for /a [18:56:36] fdisk -l that doesn't respond to anything [18:56:50] lvdisplay/vgdisplay/pvdisplay etc. then [18:57:21] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:01] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [19:01:39] robh: talked to eq about the new cabs...the 2 that came from Tampa did not have doors and we need a side panel [19:02:29] !log B side on ps1-c7-eqiad will be unplugged to be tested [19:02:48] Logged the message, Master [19:03:27] things have been a bit wacky since sept 30th: http://paste.debian.net/64713/ [19:05:22] http://paste.debian.net/64712/ [19:06:56] I've confirmed that losing db1050 was an effect, not a cause [19:07:35] (03CR) 10Dr0ptp4kt: "My manual walkthrough of the code seems to suggest that 'wiki' should work. Admittedly, some of it isn't 100% clear absent actual implemen" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94250 (owner: 10Dr0ptp4kt) [19:07:39] uh-oh [19:07:42] yeah, the increase in db errors doesn't cite db1050 especially before today [19:07:45] db1021 is in trouble [19:08:11] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [19:09:20] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [19:09:27] s5? [19:09:38] marktraceur: what does the allcampaigns api method do? [19:10:10] ori-l, lists UploadWizard campaigns [19:10:37] YuviPanda, ^^^ [19:10:37] MaxSem: can you do an explain on the query it generates and do a quick sanity check? [19:11:47] ori-l, what's up with it? [19:12:26] It has a number of unique uploaders query which was potentially sketchy if memory serves [19:12:38] https://graphite.wikimedia.org/render/?title=Top%2010%20API%20Methods%20by%20Max%2090%25%20Time%20(ms)%20log(2)%20-1day&from=-1day&width=1024&height=500&until=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle(substr(highestMax(maximumAbove(API.*.tp90,1),10),0,2)) [19:13:16] 536066597 wikiadmin 10.64.0.37:39471 wikidatawiki Query 1 Sending data SELECT /* CirrusSearchUpdater::countLinksToTitle */ COUNT(*) FROM `pagelinks` WHERE pl_namespace = '120' AND pl_title = 'P143' LIMIT 1 0.000 [19:13:16] ewww, I can tell anything in that graph [19:13:22] no data [19:13:22] lots of that [19:13:33] *can't [19:13:40] It also does a lot of wikitext parsing which was slow [19:13:40] ^ manybubbles [19:14:26] <^d> Bah. [19:14:31] <^d> Why the hell are those still running? [19:14:34] <^d> On it. [19:14:43] bawolff, parsing issue should've been resolved [19:14:53] not necessarily connected to the outage [19:15:03] <^d> I'll kill em anyway. [19:15:06] <^d> Should've ended ages ago [19:15:44] 20% of db1021's time spent there [19:15:48] 42% of the queries [19:15:55] so, yeah, kill them please :) [19:16:33] <^d> Hmm, I can't find the script running anymore :\ [19:17:07] mw1002, mw1010 etc. [19:17:21] jobrunners [19:17:25] <^d> Dur, those are in the jobqueue. [19:17:40] <^d> I was thinking it was the one-off script running still. [19:17:57] <^d> Are they all wikidatawiki? [19:19:18] dunno [19:19:44] looks like it [19:19:45] bawolff, heh - slow queries aren't disabled in prod. YuviPanda, let's disable? [19:20:01] <^d> paravoid: Ok, assumed so. I'm going to turn off Cirrus on wikidatawiki for the time being. [19:20:08] <^d> Obviously this doesn't scale for them yet. [19:20:18] yes, it's just wikidatawiki, at least for the past 24 hours [19:21:04] s1 top query seems to be [19:21:08] /* SpecialAllpages::showToplevel */ select page_title from `page` where page_namespace = ? and page_is_redirect = ? and (page_title >= ?) and (page_title >= ?) order by page_title limit ? [19:21:30] !log demon synchronized wmf-config/InitialiseSettings.php 'No more Cirrus for wikidatawiki' [19:21:37] ^d: thanks. [19:21:40] MaxSem: the query is supposed to be cached and only used on small categories (in theory) [19:21:42] <^d> np. [19:22:03] Logged the message, Master [19:23:23] <^d> !log jobqueue: dropped Cirrus-related jobs from wikidatawiki's queue [19:23:37] Logged the message, Master [19:26:11] <^d> paravoid: Ok, wikidatawiki shouldn't cause any more problems here. Let me know if that's not the case and I'll whack with a bigger hammer. [19:27:45] and for completeness: db1050: remount it -o ro,norecovery = can't read superblock | xfs_check: /dev/mapper/tank-data is invalid | xfs_repair: find verify superblock .. superblock read failed. fatal error -- invalid argument. .. dunno better now. -> RT #6244 [19:28:10] mutante: /dev/task/data is inaccessible anyway, even a simple block read doesn't work (try fdisk/dd) [19:28:17] no need for mounts or xfs repairs or such [19:29:03] bawolff, frequency is rare indeed, however the performance looks scary: https://graphite.wikimedia.org/dashboard/temporary-35 [19:29:20] ^d: I just saw this - looks like our nasty query is being nasty? [19:29:27] paravoid: alright [19:29:32] <^d> manybubbles: Yes, likely. [19:29:35] YuviPanda, yt? [19:29:48] Maxsem: I don't have permission to view that :p [19:29:50] "our nasty query" / "whack with a bigger hammer" -- can we be specific? [19:29:50] <^d> manybubbles: I turned things off on wikidatawiki for the time being and flushed the affected jobs from the queue. [19:29:57] ^d: this is probably related to the bug I filed this morning. [19:29:59] yeah [19:30:15] <^d> ori-l: "whack with a bigger hammer" - turn off more Cirrus wikis if wikidatawiki wasn't enough. [19:30:37] "bug filed this morning" == ? [19:30:37] from the logs I saw the time was mostly being spent on parsing the pages [19:30:42] no other wikis showed cirrussearch queries [19:30:47] bawolff, http://i40.tinypic.com/2ezs5jq.png [19:30:56] the effect of your change would be: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=db1021.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=MySQL+eqiad [19:31:00] ori-l: https://bugzilla.wikimedia.org/show_bug.cgi?id=56783 [19:31:03] unrelated to the outage an hour ago, though. [19:31:32] I just went digging into databases and it cropped up [19:32:02] makes sense. we never liked that query [19:32:25] does anyone know what the "/* SpecialAllpages::showToplevel */ select page_title from `page` where page_namespace = ? and page_is_redirect = ? and (page_title >= ?) and (page_title >= ?) order by page_title limit ?" query is about? [19:32:25] ^d: sounds like we'll have to replace it entirely [19:32:44] paravoid: it might be uploadwizard, not sure yet [19:33:26] it's 30-65% of time in s1 slaves [19:33:52] and very recent too [19:34:02] (source: ishmael) [19:34:06] <^d> manybubbles: Yeah. [19:36:07] <^d> ori-l: I can't find any callers to showToplevel() outside of core. [19:36:46] <^d> (And nothing in UploadWizard relating to SpecialAllpages) [19:37:00] (03PS1) 10MaxSem: Disable slow UW queries [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94404 [19:37:07] ori-l, ^^^ [19:37:31] it's definitely too rare to cause this outage, but unpleasantly slow anyway:) [19:38:41] well, it'd be good to know definitively [19:38:44] ori-l: do you indications UW was the culprit of today's 18:17 UTC outage? [19:38:50] do you have* [19:39:12] no, nothing really firm. [19:39:32] what would these queries look like? [19:39:51] sec [19:40:21] AaronSchulz: ping [19:41:35] ^d: https://bugzilla.wikimedia.org/show_bug.cgi?id=56798 [19:41:46] MaxSem: the one you'd deactivate wld be $result = $dbr->select( array( 'categorylinks', 'page', 'image' ), array( 'count' => 'COUNT(DISTINCT img_user)' ), array( 'cl_to' => $this->getTrackingCategory()->getDBKey(), 'cl_type' => 'file' ), __METHOD__, array( 'USE INDEX' => array( 'categorylinks' => 'cl_timestamp' )), array( 'page' => array( 'INNER JOIN', 'cl_from=page_id' ), 'image' => array( 'INNER JOIN', 'page_title=img_name' [19:41:46] ))); yes? [19:41:51] I'll have a look at it when I wrap up with what I'm working on now [19:42:31] ori-l, yep - getTotalContributorsCount() [19:42:38] (03PS1) 10Dzahn: move IRC stuff to module - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94407 [19:42:39] (03PS1) 10Dzahn: download server module and cleanup - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94408 [19:42:43] that doesn't fit the pattern above [19:43:34] still, that query I'm disabling takes seconds [19:43:44] (03CR) 10jenkins-bot: [V: 04-1] move IRC stuff to module - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94407 (owner: 10Dzahn) [19:44:05] yeah, but deploying it now will make it harder to know definitively what is going on [19:44:09] (03CR) 10jenkins-bot: [V: 04-1] download server module and cleanup - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94408 (owner: 10Dzahn) [19:47:43] probably unrelated, but should be checked: [19:47:47] [08-Nov-2013 19:07:13] Fatal error: Call to a member function isKnown() on a non-object at /usr/local/apache/common-local/php-1.23wmf2/extensions/WikimediaIncubator/WikimediaIncubator.class.php on line 856 [19:48:06] 8 of those all around 19:07-19:08 [19:50:20] (03PS1) 10Dzahn: turn wikistats into module - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 [19:50:23] <^d> ori-l: Filed https://bugzilla.wikimedia.org/show_bug.cgi?id=56800 to track [19:51:29] (03PS1) 10Chad: Disable Cirrus on wikidatawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94410 [19:51:53] (03CR) 10Chad: [C: 032 V: 032] "Committing my livehack. Already deployed on cluster." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94410 (owner: 10Chad) [19:55:02] (03CR) 10Ottomata: [C: 032 V: 032] Checkout geowiki's data-private repo also on statistics' webservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/94300 (owner: 10QChris) [19:56:07] paravoid, piiiinnnnnggggggggg about varnishkafka.log :) [19:56:32] ottomata: there was a pretty serious outage [19:56:42] oh [19:56:43] sorry [19:56:53] was at lunch / getting my repaired computer back [19:57:13] also a lot of wikidata exceptions of this variety: [19:57:15] 2013-11-08 18:01:20 mw1136 wikidatawiki: [f309f1f9] /w/api.php?action=translationaids&format=json&title=Translations%3AWikidata%3AProperty+proposal%2F10%2Fde Exception from line 179 of /usr/local/apache/common-local/php-1.23wmf2/extensions/Wikibase/lib/includes/EntityFactory.php: failed to deserialize [19:57:31] aude: does this look familiar ^ ? [19:57:37] once you go digging... [19:57:41] depressing [19:58:03] not wikidata specifically, just the number of issues we're finding [19:59:46] db1021 starts failing earlier [19:59:59] db1021 was being overloaded by cirrussearch [20:00:19] unless you're referring to something that happened after the deploy above [20:01:47] /* IndexPager::buildQueryInfo */ select log_id, log_type, log_action, log_timestamp, log_user, log_user_text, log_namespace, log_title, log_comment, log_params, log_deleted, user_id, user_name, user_editcount, ts_tags from `logging` left join `user` on ((log_user=user_id)) left join `tag_summary` on ((ts_log_id=log_id)) where (log_type not in(?+)) and log_user = ? and ((log_deleted & ?) = ?) and (log_type != ?) and (log_type != ?) order by [20:02:13] 9% of queries, 90% of time [20:02:16] fun [20:03:48] (03PS2) 10Vogone: Various changes to wikidatawiki's user rights configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92070 [20:04:08] cmjohnson1: asw-c-eqiad just mailed, power supply failed [20:04:15] <^d> paravoid: That IndexPager query is Cirrus? [20:04:48] no idea [20:05:03] <^d> Doesn't look like anything we'd do. [20:05:12] <^d> But I'll double check :) [20:05:55] (03PS2) 10Ottomata: Turn off geowiki monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/94290 (owner: 10QChris) [20:06:02] (03CR) 10Ottomata: [C: 032 V: 032] Turn off geowiki monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/94290 (owner: 10QChris) [20:06:11] (03PS2) 10Ottomata: Move geowiki's name for research MySQL config into separate variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94291 (owner: 10QChris) [20:06:15] (03CR) 10Ottomata: [C: 032 V: 032] Move geowiki's name for research MySQL config into separate variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94291 (owner: 10QChris) [20:06:23] (03PS2) 10Ottomata: Move geowiki's name for globaldev MySQL config into separate variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94292 (owner: 10QChris) [20:06:28] (03CR) 10Ottomata: [C: 032 V: 032] Move geowiki's name for globaldev MySQL config into separate variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94292 (owner: 10QChris) [20:06:37] (03PS2) 10Ottomata: Split geowiki paths in base path and scripts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94293 (owner: 10QChris) [20:06:47] (03CR) 10Ottomata: [C: 032 V: 032] Split geowiki paths in base path and scripts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94293 (owner: 10QChris) [20:06:54] (03PS2) 10Ottomata: Move geowiki scripts into geowiki's scripts subdirectory [operations/puppet] - 10https://gerrit.wikimedia.org/r/94294 (owner: 10QChris) [20:06:59] (03CR) 10Ottomata: [C: 032 V: 032] Move geowiki scripts into geowiki's scripts subdirectory [operations/puppet] - 10https://gerrit.wikimedia.org/r/94294 (owner: 10QChris) [20:07:08] huh [20:07:10] (03PS2) 10Ottomata: Rename geowiki backups to logs, as they are only logs [operations/puppet] - 10https://gerrit.wikimedia.org/r/94295 (owner: 10QChris) [20:07:12] second time I'm seeing this msnbot IP [20:07:14] ^d: I think it's just normal log pages [20:07:17] in a different shard [20:07:42] (03CR) 10Ottomata: [C: 032 V: 032] Rename geowiki backups to logs, as they are only logs [operations/puppet] - 10https://gerrit.wikimedia.org/r/94295 (owner: 10QChris) [20:07:49] <^d> AaronSchulz: That's what I thought. [20:07:53] https://en.wikipedia.org/wiki/Special:Log/Aaron_Schulz [20:08:01] (03PS2) 10Ottomata: Move geowiki data checkout into geowiki's base directory [operations/puppet] - 10https://gerrit.wikimedia.org/r/94296 (owner: 10QChris) [20:08:06] (03CR) 10Ottomata: [C: 032 V: 032] Move geowiki data checkout into geowiki's base directory [operations/puppet] - 10https://gerrit.wikimedia.org/r/94296 (owner: 10QChris) [20:08:12] heh, well that's still faster than it was last month ;) [20:08:14] (03PS2) 10Ottomata: Split geowiki data repository into private and public parts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94297 (owner: 10QChris) [20:08:18] (03CR) 10Ottomata: [C: 032 V: 032] Split geowiki data repository into private and public parts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94297 (owner: 10QChris) [20:08:26] (03PS2) 10Ottomata: Turn on generating geowiki's limn files [operations/puppet] - 10https://gerrit.wikimedia.org/r/94298 (owner: 10QChris) [20:08:33] (03CR) 10Ottomata: [C: 032 V: 032] Turn on generating geowiki's limn files [operations/puppet] - 10https://gerrit.wikimedia.org/r/94298 (owner: 10QChris) [20:08:41] (03PS3) 10Ottomata: Rsync geowiki's bare data-private repository to statistics' webservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/94299 (owner: 10QChris) [20:08:46] (03CR) 10Ottomata: [C: 032 V: 032] Rsync geowiki's bare data-private repository to statistics' webservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/94299 (owner: 10QChris) [20:08:53] (03PS3) 10Ottomata: Checkout geowiki's data-private repo also on statistics' webservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/94300 (owner: 10QChris) [20:08:58] (03CR) 10Ottomata: [C: 032 V: 032] Checkout geowiki's data-private repo also on statistics' webservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/94300 (owner: 10QChris) [20:09:14] gwicke: it's better to briefly mention what it is [20:11:17] paravoid: the msg for asw-c is due to the lack of redundancy on asw-c7 atm [20:12:09] (03CR) 10John F. Lewis: [C: 031] "Per re closure of the RfC and update of the commit, all is in order." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92070 (owner: 10Vogone) [20:17:16] AaronSchulz: the http requests from our jobs don't seem to reach the varnishes any more [20:17:20] (03PS1) 10Odder: (bug 56760) Update logo for Korean Wikibooks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94413 [20:17:30] since about 24 hours ago [20:18:32] they all seem to finish successfully within a very short time in the runJobs.php log [20:19:06] I suspected a use of $wgParsoidSkipRatio, but as far as I can tell that is still zero in production [20:20:19] (03PS1) 10QChris: Initialize geowiki_base_path for geowiki's private data repo [operations/puppet] - 10https://gerrit.wikimedia.org/r/94415 [20:20:24] PROBLEM - Host mw72 is DOWN: PING CRITICAL - Packet loss = 100% [20:20:59] (03CR) 10Ottomata: [C: 032 V: 032] Initialize geowiki_base_path for geowiki's private data repo [operations/puppet] - 10https://gerrit.wikimedia.org/r/94415 (owner: 10QChris) [20:21:04] ori-l: https://ishmael.wikimedia.org/more.php?hours=24&host=db1052&checksum=2279768281867546317 [20:21:24] RECOVERY - Host mw72 is UP: PING OK - Packet loss = 0%, RTA = 35.39 ms [20:21:35] the query I was talking about before [20:21:43] (03PS2) 10Dzahn: move IRC stuff to module - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94407 [20:22:26] 85% (was 90% before) of db1052's time [20:24:13] it looks like includes/api/ApiQueryLogEvents.php ? [20:25:17] (03PS1) 10Odder: (bug 56761) Add shortcut for NS_PROJECT for kowiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94417 [20:27:23] gwicke: do you actually check the http status for anything in the job? [20:28:15] * AaronSchulz sees no profiling either [20:28:17] (03PS2) 10Dzahn: download server module and cleanup - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94408 [20:28:22] ori-l: I think it's more likely includes/ChangeTags.php due to the `LEFT JOIN `tag_summary` ON ((ts_log_id=log_id))` clause. [20:28:52] 536184595 wikiuser 10.64.32.55:52792 dewiki Query 208 Sending data SELECT /* IndexPager::buildQueryInfo (LogPager) 157.55.32.209 */ log_id,log_type,log_action,log_timestamp,log_user,log_user_text,log_namespace,log_title,log_comment,log_params,log_deleted,user_id,user_name,user_editcount,ts_tags FROM `logging` LEFT JOIN `user` ON ((log_user=user_id)) LEFT JOIN `tag_summary` ON ((ts_log_id=log_id)) WHERE (log_t [20:28:59] is one of the ones I captured before [20:29:07] the IP is msnbot's, don't worry about leaking it [20:30:13] (03PS1) 10QChris: Rsync geowiki's data-private bare repo as $geowiki_user [operations/puppet] - 10https://gerrit.wikimedia.org/r/94418 [20:30:22] eh, msnbot crawls special pages? [20:30:43] shouldn't it be prohibited by robots.txt? [20:30:43] ori-l: The query with IndexPage::buildQueryInfo in the comment comes from LogPager, with the bit bd808 mentioned coming from a call to ChangeTags. The API would have ApiQueryLogEvents::execute in the comment instead. [20:30:45] AaronSchulz: since the requests never seem to arrive I am suspecting some network or curl library issue [20:31:27] (03PS1) 10Jforrester: Make VisualEditor namespaces extend, not replace, default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94420 [20:32:06] AaronSchulz: was there any change in that regard on the job runners ~24 hours ago? [20:32:15] (03PS4) 10Jforrester: Enable VisualEditor on board, collab, office and wikimaniateam wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93780 [20:32:37] not that I know of [20:32:47] (03CR) 10Ottomata: [C: 032 V: 032] Rsync geowiki's data-private bare repo as $geowiki_user [operations/puppet] - 10https://gerrit.wikimedia.org/r/94418 (owner: 10QChris) [20:32:51] AaronSchulz: k [20:33:20] would be nice to have decent error logging in the job though [20:34:52] MaxSem [20:35:02] yep? [20:35:03] i think i saw some change to robots.txt stuff yesterday. [20:35:09] looking [20:35:36] 19:26 logmsgbot: reedy synchronized w/robots.php [20:35:40] hmm. maybe just thtat [20:37:01] https://gerrit.wikimedia.org/r/#/c/92566/2/w/robots.php [20:37:12] AaronSchulz: yeah, might look into that if I don't find a better solution [20:37:19] $robotsfile = "/apache/common/robots.txt"; -> $robotsfile = '/usr/local/apache/common/robots.txt'; [20:37:30] and/or if we end up using this with storage too [20:37:44] AaronSchulz: is there a list of job runner servers somewhere? [20:38:00] would be interesting to do some test requests to the parsoid cache from there [20:38:45] (03CR) 10Chad: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92566 (owner: 10Reedy) [20:40:29] MaxSem: well, looked like something to check but nevermind, i looked on a random server and those are identical. though none of them disallow msnbot [20:44:15] AaronSchulz: tried ab with dsh -g job-runners, that seems to work [20:49:21] servers are mw1001-1016, which should be in that group [20:55:19] AaronSchulz: thanks [20:56:25] (03PS1) 10QChris: Clone geowiki repos for geowiki's group [operations/puppet] - 10https://gerrit.wikimedia.org/r/94422 [20:57:47] (03PS3) 10coren: Various changes to wikidatawiki's user rights configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92070 (owner: 10Vogone) [20:59:02] (03CR) 10coren: [C: 032] "LGTM, deploying." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92070 (owner: 10Vogone) [21:00:40] !log marc synchronized wmf-config/InitialiseSettings.php 'Bug: 56203 - Various changes to wikidatawiki's user rights configuration' [21:00:41] (03CR) 10Chad: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92070 (owner: 10Vogone) [21:00:48] * ^d sighs [21:00:57] Logged the message, Master [21:02:57] (03PS1) 10Dzahn: remove db61 [operations/dns] - 10https://gerrit.wikimedia.org/r/94426 [21:05:02] (03PS1) 10Dzahn: remove db61 from DHCP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94427 [21:06:28] PROBLEM - Backend Squid HTTP on sq48 is CRITICAL: Connection refused [21:10:29] (03PS1) 10Ori.livneh: logmsg-git-hook: fix commit determination logic & run on new MW branch creation [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94429 [21:12:15] ^d: all lies! it's gerrit's fault! [21:15:55] AaronSchulz: found the issue, was my buggy port of your curl perf fix [21:15:56] (03CR) 10Ottomata: [C: 032 V: 032] Clone geowiki repos for geowiki's group [operations/puppet] - 10https://gerrit.wikimedia.org/r/94422 (owner: 10QChris) [21:22:20] !log marc synchronized wmf-config/InitialiseSettings.php 'Bug: 56203 - Various changes to wikidatawiki's user rights configuration (for real this time)' [21:22:43] Logged the message, Master [21:23:15] "But that trick never works! -- This time for real!" [21:23:26] (a.k.a.: pull before sync-file) [21:23:56] (03PS1) 10Ottomata: Adding trailing / to geowiki private data bare paths in rsync command [operations/puppet] - 10https://gerrit.wikimedia.org/r/94432 [21:24:13] (03PS2) 10Ottomata: Adding trailing / to geowiki private data bare paths in rsync command [operations/puppet] - 10https://gerrit.wikimedia.org/r/94432 [21:24:19] <^d> Coren: Been there. Any time you see me do that it's because I forget to merge after fetching :p [21:24:28] (03CR) 10Ottomata: [C: 032 V: 032] Adding trailing / to geowiki private data bare paths in rsync command [operations/puppet] - 10https://gerrit.wikimedia.org/r/94432 (owner: 10Ottomata) [21:31:50] (03PS1) 10Odder: (bug 56412) Make all sidebar phrases on Planet translatable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94433 [21:32:24] (03CR) 10jenkins-bot: [V: 04-1] (bug 56412) Make all sidebar phrases on Planet translatable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94433 (owner: 10Odder) [21:33:47] (03CR) 10Ori.livneh: [C: 032] logmsg-git-hook: fix commit determination logic & run on new MW branch creation [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94429 (owner: 10Ori.livneh) [21:37:41] another self merge [21:38:17] !log ori updated /a/common to {{Gerrit|I30f4e5975}}: logmsg-git-hook: fix commit determination logic & run on new MW branch creation [21:38:36] Logged the message, Master [21:38:44] greg-g: are you keeping score? [21:39:00] ori-l: go write your name on the board [21:39:27] (03PS2) 10Odder: (bug 56412) Make all sidebar phrases on Planet translatable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94433 [21:39:59] (03CR) 10jenkins-bot: [V: 04-1] (bug 56412) Make all sidebar phrases on Planet translatable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94433 (owner: 10Odder) [21:41:46] <^d> ori-l, greg-g: If we're counting, doesn't everything from SVN days count as a self-merge? ;-) [21:42:54] (03PS3) 10Odder: (bug 56412) Make all sidebar phrases on Planet translatable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94433 [21:43:22] ^d: counter was reset with git, luckily [21:43:37] <^d> Nofair. [21:45:56] (03PS4) 10Odder: (bug 56412) Make all sidebar phrases on Planet translatable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94433 [22:00:26] (03PS1) 10Odder: (bug 56807) Localize logo for Welsh Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94438 [22:00:58] PROBLEM - Puppet freshness on sq80 is CRITICAL: No successful Puppet run in the last 10 hours [22:09:03] osm-web1-4 can really be wiped, right? they are existent but shutdown and there is osm-web100x, at least in DNS [22:10:02] mutante, bblack should know [22:10:26] Eloquence: kk, thanks [22:10:31] bblack: ^ [22:12:48] mutante: in eqiad? [22:13:49] I think osm-web[1234] are in pmtpa in DNS, and yeah I donno what those are but I'm not using them [22:13:54] bblack: nah, it's about wiping Tampa [22:13:58] so we have dupes of them in pmtpa [22:14:11] I'm just using the 4-numbered ones in eqiad [22:14:15] they are shutdown but mgmt is still there, just making sure chris can de-rack [22:14:16] they prolly dont need wipe, cuz they prolly never got an OS [22:14:19] and not care about data [22:14:24] they should unrack and ship to new datacenter [22:14:29] that^ [22:14:33] but stay slated for OSM use [22:14:37] ok, thanks [22:14:42] cmjohnson1: ^ [22:14:46] (so their names will change to osm XXXX whatever we set new range [22:14:47] ) [22:15:01] RT 6256 [22:15:14] where XXXX is consecutive 4-digit sequences from the digits of pi [22:15:24] :) [22:15:33] bblack: our standards arent that clear. [22:16:12] prolly 2k range [22:16:16] since we reserved it for new site. [22:16:24] cmjohnson1: sounds like just put them in a box [22:16:29] https://wikitech.wikimedia.org/wiki/Server_naming_conventions [22:16:47] so sez robh namer of servers! [22:17:18] while we are at it, you guys fight over what the new misc server standard will be in new dc. [22:17:29] just keep in mind it cannot be named after living folks. [22:18:08] and should scale to 150 or so names at minimum [22:18:37] classic but good, astronomy https://en.wikipedia.org/wiki/List_of_proper_names_of_stars [22:19:26] im cool with that as long as we have a list i can work off of in a specific order [22:19:30] so if other opsen dont bitch [22:19:32] its gonna be that [22:19:40] (better suggest something else if you hate it folks) [22:19:51] order by messier number [22:20:03] https://en.wikipedia.org/wiki/List_of_Messier_objects [22:20:20] but those dont all have common names [22:20:21] arg, if they have common names :P [22:20:27] and if we are going to go with # then asset tag! [22:20:44] i am cool iwth asset tag, but you damned humans and your need for 'names' [22:20:51] a serial # is a good name. [22:21:54] <^d> How about we name them after famous encyclopedists. [22:22:01] ^d: already taken [22:22:05] we did [22:22:20] i think we should go with binary names [22:22:21] 0110001101101000011100100110100101110011 [22:22:30] <^d> mutante: irc needs sarcasm tags ;-) [22:22:34] is that ascii? [22:22:39] ^d: fuck you. [22:22:44] =] [22:22:55] see my sarcasm tag? ;] [22:23:02] <^d> Yep, I saw it. [22:23:05] wow, is that a MatthewARoth on IRC? [22:23:08] * jeremyb blinks [22:23:12] who didnt love eiximenis? [22:23:15] i loved that server! [22:23:18] sanger [22:23:20] it was awesome [22:23:21] <^d> I loved eiximenis. [22:23:26] mutante: blame mark [22:23:28] And erzurumi (sp?) too [22:23:30] he named the mail servers [22:23:37] RoanKattouw: you spelled it right, see its not hard. [22:23:51] <^d> I will forever know how to spell eiximenis now. [22:23:59] I think I can even spell praseodymium now [22:24:03] im gonna name my next laptop eiximenis [22:24:14] * bd808 is used to servers named r01-473 and r02-517 [22:24:16] root@box-we-bought-on-2013-11-23 [22:24:26] bd808: hey i wanted to name them after our unique wmf asset tag [22:24:26] ^d: isn't it spelled "etherpad"? [22:24:32] so everythign would be WMFXXXX [22:24:36] but nooo [22:24:54] RobH: i wonder how many dr who alien species exist.... [22:25:01] p858snake|l: veto! [22:25:05] i dont like dr who. [22:25:10] <^d> Planets in Star Wars? [22:25:13] * RobH prepares for the coming fight [22:25:34] heh, i actually dont care ;] [22:25:37] <^d> Pokemon? There's > 700 now [22:25:41] if someone provided me a list of them, it works. [22:25:49] eh, pokemon is not good [22:25:56] i dont like going near anyone lawsuit happy for names [22:26:05] pretty sure bbc would just think its neat. [22:26:08] RobH: we have this thing called a wiki, which has a list >.> [22:26:20] yes, but im not willing to go generate it ;] [22:26:26] <^d> [[List of Pokémon]] :) [22:26:30] you could use wikidata for that too! [22:26:31] i dont want to hear complaints like for eiximenis! [22:26:44] MaxSem: in the airport [22:26:49] The rNN-XXX scheme was {rack number}-{vlan}{instance} or something like that. I had a lot of aliases in my ssh-config :) [22:26:50] before we do pokemons i'll go with IRC acronyms: lol, rofl and lmfao [22:27:03] YuviPanda, ah - fly safely then:) [22:27:05] someone crashed our bastion, omgwtfbbq! [22:27:15] ping roflcopter [22:27:29] outages are so much funnier now. [22:27:35] :p [22:27:39] kthxbai [22:27:45] heh [22:27:47] :) [22:27:52] MaxSem: did UW cause the outage? [22:27:53] <^d> lolololol [22:28:00] <^d> ^ That's a server, not my response [22:28:09] oh man, that cluster of lol lolol and lololol are confusing [22:28:11] http://db.debian.org/machines.cgi [22:28:17] there's an asdfasdf [22:28:17] YuviPanda, no - but I discoverd that upload counting is brutally slow, it will be disabled [22:28:28] do we really want .wmnet ? /me hides [22:28:29] jeremyb: not a d.o machine [22:28:29] jeremyb: thats my password! [22:28:43] no one login to officewiki as me now ;] [22:28:43] paravoid: it's on the list... [22:28:43] we could start a rfc on mw wiki >.> *runs [22:28:46] I had an admin who used Go terms for naming at another gig. Aji, Atari, Kakari, Seki, ... [22:28:48] MaxSem: hmm, okay! I did build it with a kill switch... [22:28:55] jeremyb: look closer [22:29:11] paravoid: i see it's .net. but still it's on the list [22:29:17] MaxSem: wait, *upload* counting shouldn't be slow. *Uploaders* counting would be slow [22:29:30] MaxSem: i saw the gerrit patchset, that'll disable just uploaders counting [22:29:30] err, yes [22:29:34] ok :) [22:29:34] RobH: go VillagePump :) [22:29:46] jeremyb: the point is, debian.org machines *do* follow a convention, but this isn't one of them [22:30:00] out of scope of village pump [22:30:02] at best, wikitech. [22:30:12] but, i dont wanna take every person on the internets viewpoint [22:30:13] jeremyb: I know! I thought the same thing. [22:30:18] only the witty ones offered in here. [22:30:18] paravoid: what's the convention? [22:30:38] classical music composers [22:30:45] huh [22:30:57] with an attempt to hint the purpose from the name [22:31:14] so the lists machine used to be liszt [22:31:23] one of the MXs is mailly [22:31:30] hah [22:31:40] arm boxes start from "a" [22:31:58] amd also starts with a? [22:31:58] arnold, arne, argento, arcadelt [22:32:01] i saw one that does [22:32:05] "ar" [22:32:24] agricola is not ar [22:32:29] I know [22:32:33] these isn't a hard rule [22:32:37] (the composers is, though) [22:32:55] well, even that is a lie, the bytemark blades don't follow that [22:33:28] (bm-blN) [22:33:42] do ganeti guests have a scheme? [22:33:45] no [22:34:10] UCC name some of after FIsh starting with the letter "M" http://wiki.ucc.asn.au/Nomenclature (They also have a internet connected coke machine, which must count somewhere...) [22:34:13] <^d> RobH: Crayola colors? :p [22:34:25] sounds good to me [22:34:31] take classic 200 count box of crayons. [22:34:44] ^d: "stop acting like a dick on the black box" [22:34:46] increment by RGB [22:34:47] heh [22:34:50] <^d> https://en.wikipedia.org/wiki/List_of_Crayola_crayon_colors [22:34:51] might not sound the best, etc [22:35:03] we can make any of it sound bad [22:35:06] thats not even hard ;] [22:35:25] ^d: im adding this to the page as your suggestion [22:35:26] ^d: did they ever finish the edit war on that page, over the colour values? [22:35:47] ^d: a picture for outage report would be like this http://www.collegegloss.com/2012/03/wall-art-diy_15.html [22:35:54] <^d> p858snake|l: Hell if I know :p [22:36:10] are any edit wars notable? [22:36:49] picks "Banana Mania" first [22:37:02] https://wikitech.wikimedia.org/wiki/Talk:Server_naming_conventions [22:37:23] <^d> I call Purple Mountain's Majesty [22:37:41] mutante: because that shit is banana? [22:37:51] hrmm [22:37:59] i wanna say 'adventure time characters' [22:37:59] Macaroni and Cheese is #FFBD88 [22:38:02] but there arent enough [22:38:29] wait [22:38:30] http://adventuretime.wikia.com/wiki/Category:Characters [22:38:32] yes. [22:38:33] <^d> We can't use Pokémon but we can use Adventure Time? :p [22:38:41] adventure time characters willb e next standard! [22:38:50] mathmatical! [22:38:59] (no one watches but me, im ok with that.) [22:39:06] Monsters from MMPR? [22:39:23] RobH: I watched once, but i'm still confused... [22:39:40] you should watch from first episode [22:39:47] then its only slightly less confusing. [22:39:57] (slightly) [22:40:16] is there enough dinosaurs yet? [22:40:57] risky [22:41:10] since they eliminate those regularly as they found out they are different ages of same one [22:41:12] http://www.hp-lexicon.org/wizworld/beans.html [22:41:14] <^d> RobH: Nobody watches but you? *A-HEM* [22:41:31] <^d> We could name servers after buzzwords. [22:41:37] cloud is down. [22:41:40] the flavors would be "sardine, black pepper, grass, horseradish, vomit, booger, earwax, dirt, earthworm, spaghetti, spinach, soap, sausage, pickle, bacon, and rotten egg" [22:41:46] ok, time for break, bbl [22:41:49] <^d> "agile" "cloud" "web-two-point-oh" [22:41:54] <^d> :) [22:41:58] !log Shut down synergy [22:42:14] http://www.youtube.com/watch?v=Tx1XIm6q4r4 [22:42:17] <^d> <^d> !log agile is being slow [22:42:18] I see it's Friday afternoon [22:42:24] I want a Big Data server. A really Big one. [22:42:40] <^d> We should've called the netapp big-data :p [22:42:47] !log Created account for Erik on bigger-data [22:43:01] ^d: db -> webscale [22:43:16] <^d> The fun part is when you or I forget to put our nick in front and accidentally log the joke. [22:43:20] <^d> RoanKattouw: ^ [22:43:33] yup [22:46:32] * greg-g still likes https://wikitech.wikimedia.org/w/index.php?title=Server_admin_log%2FArchive_22&diff=74450&oldid=74448 [22:49:16] hehe [22:57:55] movie references (from a pool of any movies), eg: burnbook for a log collector... [23:00:26] (03CR) 10Dr0ptp4kt: "Another manual walkthrough suggests that if the key was set to 'wikipedia', the array merge operation wouldn't pull up anything because of" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94250 (owner: 10Dr0ptp4kt) [23:03:10] I wonder if we have a [[List of cancers]] yet..., although that might be a tad out there for server names [23:08:31] heh, the status box on the right is a bit outdated (along with the page, probably): https://wikitech.wikimedia.org/wiki/Swift [23:11:27] greg-g: most of our releases are just compatibility tweaks, without new features [23:11:51] gwicke: yeah, parsoid is a bit odd, there :) [23:12:12] so you'd get an empty summary most of the time [23:12:28] that's not fun [23:12:39] the shortlog is more interesting [23:13:00] k, shortlog + human english summary (I might not know how to translate it myself) [23:13:10] (just a summary of the important bits, obviously) [23:14:18] greg-g: the other problem is that a week in advance that log will still be empty [23:14:29] and we won't know yet what we'll get done in the week ahead [23:14:33] well, right, so, human english summary of what you plan to have done [23:14:50] I assume you have an idea of what you're going to work on, generally :) [23:14:58] yeah, typically bugs [23:15:10] even better [23:15:16] bug #s are great [23:15:19] 'bug fixes' ;) [23:15:25] ;) [23:15:53] which get done is often hard to say in advance [23:17:07] well, right, best laid plans and all [23:18:53] greg-g: maybe we should link to the hash from the deploy page in addition to the server admin log [23:18:59] or link to the server admin log in general [23:19:20] well, those are expo facto things :/ [23:19:25] I consider the latter more reliable information on what actually happened rather than what is planned [23:19:51] so, the thing is I'm looking for what is planned. I assume Monday morning isn't a complete black box to you of what you'll work on [23:20:33] greg-g: we have mostly areas we are working on [23:20:48] the exact bugs are often discovered while investigating etc [23:21:23] we can try, but I can't promise you that it will have much to do with the actual deploy [23:22:01] gwicke: let's try and iterate, best we can do, really [23:22:16] gwicke: so, what's going to happen, you think, on Tuesday (in 4 days)? :) [23:22:33] greg-g: is there a deploy window? [23:22:35] er, Wed in 5 days [23:23:19] yeah, see last 5 lines of email (I know, I left the most juicy bits for the end) [23:24:06] greg-g: very likely several bug fixes [23:24:45] marc might have done some work on full-stack testing by then [23:24:57] not relevant for the deploy, but still.. [23:25:32] I like knowing that stuff, but yeah, not germane [23:25:54] anyway, we have a meeting each Wednesday where we can try to be more informative than 'bug fixes, performance improvements and clean-up' [23:26:03] greg-g, gwicke we are also now moving towards having wed meetings which sould help some [23:26:07] oh, gwicke said the same :) [23:26:28] oh, interesting [23:26:33] are there notes from that? [23:26:44] greg-g: it is public in #mediawiki-parsoid [23:26:47] you are welcome to attend [23:26:51] no summary/notes? [23:26:56] I can't attend every team's meetings [23:27:13] the log is usually fairly condensed already [23:27:19] it is only 30 minutes [23:27:26] Wed at 10 [23:27:33] gwicke: so, -parsoid is active, I can't reliably read scrollback [23:27:38] help me out here [23:27:38] gwicke, greg-g, i meant to suggest that we could use the meetings to update what we might deploy. [23:27:46] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:28:01] yup, that's what I said too [23:28:14] I don't scale that well [23:28:19] I can't attend every team's meetings [23:28:20] greg-g, do you want a summary of what we discussed at the meetings? or more about using that to inform the deploy plan? [23:28:26] i assumed th elatter [23:28:31] yeah, later mostly [23:28:37] ok, so, we have that covered then. [23:28:41] sweet [23:28:46] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2621101 seconds since restart [23:49:20] andrewbogott, around? [23:50:06] (03PS1) 10Ori.livneh: updateBitsBranchPointers: get rid of 'static-stable' branch link [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94447 [23:50:47] Eloquence: Briefly -- what's up? [23:51:21] andrewbogott, can you add mwalker and myself to the pediapress project in labs just so we can poke at if needed for the pdf sprint? [23:51:36] yep, one moment... [23:52:40] Eloquence: need projectadmin bit so you can create instances and such? [23:52:53] can't hurt, thanks [23:53:16] ok -- should be all set. [23:53:19] thanks :) [23:53:33] don't know if I trust him with projectadmin... [23:54:04] heh [23:54:18] greg-g: I'll keep a close eye on him :)