[00:04:02] !log krinkle synchronized php-1.23wmf3/resources 'I8704a6620ece44d' [00:04:22] Logged the message, Master [00:04:54] Krinkle: ? [00:05:18] Krinkle: You're in my deploy window ... [00:05:48] :( [00:05:58] bad timo [00:06:10] yo dawg, i heard you like deploys, so i put a deploy in your deploy window so you can deploy while i deploy. [00:06:36] gods, that's not even funny. i should go sleep. [00:06:47] but that joke always SEEMS to make sense [00:06:58] Sorry, forgot to look at the calendar. I intended to do this a few hours back but had it open still. [00:07:33] Krinkle: ping me next time ;) [00:08:43] !log catrope synchronized php-1.23wmf3/extensions/VisualEditor 'add more EventLogging events (https://gerrit.wikimedia.org/r/#/c/94092/ )' [00:09:00] Logged the message, Master [00:09:15] wow, that's a lot [00:09:15] (03PS2) 10Dr0ptp4kt: Load ZeroRatedMobileAccess only where currently supported. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94250 [00:09:38] It's deceptive because every flavor of saveError is a different event [00:09:39] ^MaxSem [00:09:45] er ^^ [00:09:52] Half of that list is just breaking down why exactly the user's save failed [00:10:27] dr0ptp4kt, still 'wikipedia' instead of 'wiki' [00:10:28] (deploy done) [00:10:31] RoanKattouw: I like [00:10:39] MaxSem, ah, ok [00:10:43] MaxSem: But it *is* 'wikipedia' instead of 'wiki' , isn't it? [00:10:43] hang on [00:11:00] That's what $site is set to at least [00:11:15] I just read through the relevant code in CommonSettings.php with dr0ptp4kt looking over my shoulder about an hour ago [00:12:49] (03PS1) 10QChris: Turn off geowiki monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/94290 [00:12:50] (03PS1) 10QChris: Move geowiki's name for research MySQL config into separate variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94291 [00:12:51] (03PS1) 10QChris: Move geowiki's name for globaldev MySQL config into separate variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94292 [00:12:52] (03PS1) 10QChris: Split geowiki paths in base path and scripts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94293 [00:12:53] (03PS1) 10QChris: Move geowiki scripts into geowiki's scripts subdirectory [operations/puppet] - 10https://gerrit.wikimedia.org/r/94294 [00:12:54] (03PS1) 10QChris: Rename geowiki backups to logs, as they are only logs [operations/puppet] - 10https://gerrit.wikimedia.org/r/94295 [00:12:55] (03PS1) 10QChris: Move geowiki data checkout into geowiki's base directory [operations/puppet] - 10https://gerrit.wikimedia.org/r/94296 [00:12:56] (03PS1) 10QChris: Split geowiki data repository into private and public parts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94297 [00:12:57] (03PS1) 10QChris: Turn on generating geowiki's limn files [operations/puppet] - 10https://gerrit.wikimedia.org/r/94298 [00:12:58] (03PS1) 10QChris: Rsync geowiki's bare data-private repository to statistics' webservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/94299 [00:12:59] (03PS1) 10QChris: Checkout geowiki's data-private repo also on statistics' webservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/94300 [00:15:23] (03PS1) 10Chad: Move autosetuprebase to where it will actually do something useful [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94301 [00:15:32] <^d> Reedy: Lol ^ [00:17:48] (03PS1) 10Chad: Use descriptive heredoc [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94303 [00:21:43] !log csteipp synchronized php-1.23wmf3/includes 'bug 55332' [00:21:55] Logged the message, Master [00:21:58] !log Reloading zuul to enable Iaceb016cf7df20 [00:22:11] Logged the message, Master [00:27:42] PROBLEM - Host analytics1012 is DOWN: PING CRITICAL - Packet loss = 100% [00:29:03] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:30:52] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:34:19] Hm.. Ganglia's dashboards seem to be broken [00:34:19] https://ganglia.wikimedia.org/latest/tasseo.php?view_name=Navigation+Timing# [00:34:23] Uncaught TypeError: Cannot read property 'length' of null [00:35:12] [00:36:02] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [00:38:02] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [00:39:23] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [00:41:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [00:44:42] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [00:45:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [00:48:56] (03PS1) 10Dzahn: fix pdf servers in dsh groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/94307 [00:50:26] ottomata: upgrading analytics? [00:55:45] What does the "sd" in sdtpa stand for? [00:56:52] switch&data [00:57:06] vendor-airport [00:57:34] k [00:57:46] I figured that from esams and equid, but couldn't find out which one it stood for [00:57:49] Krinkle: https://en.wikipedia.org/wiki/Switch_and_Data [00:57:54] acquired by Equinix [00:58:03] which is the eq in eqiad [00:58:17] yeah [00:59:42] PROBLEM - MySQL InnoDB on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:42] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:00:20] mutante: Hm.. So what does "pm" stand for? [01:00:24] <^d> power medium [01:00:30] :D [01:00:39] important facts are important [01:00:46] street name? [01:00:52] company name [01:01:02] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:01:06] [00:57:03] vendor-airport [01:01:07] put it in wikidata :p [01:01:18] ulsfo [01:01:21] united layer-sfo [01:01:27] knams [01:01:28] Reedy: Wait... pmtpa != sdtpa? [01:01:33] <^d> Same building. [01:01:35] kennisnet - esams [01:01:37] different floor [01:01:38] <^d> Different providers. [01:01:38] Different company [01:01:41] different floor [01:01:43] heh [01:01:56] same fibre that can be cut by a lawnmower [01:02:01] haha [01:02:02] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:28] <^d> yaseo -> crappy colo in seoul [01:02:33] RECOVERY - MySQL InnoDB on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:02:52] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:02:52] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [01:03:42] <^d> Reedy: Can I haz merge? [01:04:16] seo is not airport ? [01:04:30] ICN [01:04:32] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2540455 seconds since restart [01:04:50] Krinkle: This is all documented on Meta-Wiki, BTW. [01:04:53] <^d> Yeah probably should've been icn in hindsight :) [01:05:01] Elsie: Not on wikitech, that's where I was looking [01:05:12] at least switch&data and power medium were nowhere mentioned. [01:05:16] https://meta.wikimedia.org/wiki/Wikimedia_servers#Hosting [01:05:28] <^d> We should document the servers on wikitech. [01:05:43] mutante: https://en.wikipedia.org/wiki/S%C3%A9gu%C3%A9la_Airport is not particularly close to Seoul [01:05:53] ^d: That's been tried. It failed. [01:06:10] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:01] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:10] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [01:07:26] Elsie: ^d : wait. https://wikitech.wikimedia.org/w/index.php?limit=50&tagfilter=&title=Special%3AContributions&contribs=user&target=Dzahn&namespace=&tagfilter=&year=2013&month=10 [01:07:50] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2540653 seconds since restart [01:07:59] RoanKattouw: yea, indeed. didn't follow the convention [01:08:23] https://wikitech.wikimedia.org/wiki/Special:Log/delete/RobH [01:08:35] Though I thought there were more deletions... [01:09:04] (03PS3) 10Dr0ptp4kt: Load ZeroRatedMobileAccess only where currently supported. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94250 [01:09:40] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [01:09:44] Elsie: it failed because we had no good system for doing so [01:10:00] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [01:10:04] I thought there was some other inventory system that was being used. [01:10:08] PowerRack or something? [01:10:13] we're using racktables and we all hate t [01:10:15] *it [01:10:19] Ah, that was it. [01:10:21] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:28] So back to the wiki? :-) [01:10:43] I'd move it there, but I don't have time to set up a reasonable SMW system for it [01:10:59] I think others are looking at some other systems [01:11:14] i agree we shouldn't delete pages [01:11:23] they should be moved to something like archive [01:11:26] if there is the need to [01:11:38] we have an archive namespace, I think [01:11:42] but there are concerns that they show up in searches [01:11:55] which i think is a good thing [01:11:57] and that namespace is left out of the default search [01:12:10] as long as you can tell it's just history vs. current info [01:12:10] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [01:12:12] (03CR) 10Dr0ptp4kt: [C: 04-2] "Let's discuss whether 'wikipedia' or 'wiki' is more correct." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94250 (owner: 10Dr0ptp4kt) [01:12:18] Ryan_Lane: sounds good, yea [01:14:17] greg-g: for the code deploy dashboard... [01:14:40] trebuchet writes its deployment info into redis [01:14:58] if we change its schema some we can track each deployment separately by tag [01:15:19] as well as the deployment message that went along with it [01:15:26] then a dashboard could just read from redis [01:16:02] currently each deployment for a repo overwrites the data from the last, to make things simpler [01:16:30] hm. or does it? did I change that [01:16:35] I need to document the schema being used [01:17:05] (03CR) 10Chad: "It's wiki, not wikipedia." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94250 (owner: 10Dr0ptp4kt) [01:17:23] if we are not going to use wikitech, and we don't want google docs, and we dont want bugzilla and RT isn't enough and we cant agree on using dsh groups.. i'm at a loss here how to track this [01:17:27] bbl [01:17:51] hm, I should also put the schema/attribute mapping in a pillar so that it can be changed without needing to modify the code everywhere [01:18:07] mutante: how to track what? [01:18:10] servers? [01:19:02] yes, and specifically the ones left in tampa and which can be really decom [01:19:09] ah [01:19:14] salt grains? :) [01:19:26] you can add/remove them via a command [01:20:02] then you can also target them using salt [01:20:07] it wouldnt cover pdf and hardy and [01:20:09] (03CR) 10Ori.livneh: [C: 032] Update static-current symlinks to 1.23wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94277 (owner: 10Ori.livneh) [01:20:12] ugh [01:20:13] right [01:20:29] I forgot we still have hardy systems [01:20:51] well, this is not that outdated now https://wikitech.wikimedia.org/wiki/Tampa_cluster#Misc._Services_Pending_Migration [01:20:58] but i need to get food [01:22:17] <^d> Nobody ever e-mailed ops about formey decom. [01:22:28] <^d> I remember someone saying that the other day. [01:22:33] <^d> *ops list [01:26:00] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [01:26:42] (03CR) 10Catrope: [C: 032] Move autosetuprebase to where it will actually do something useful [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94301 (owner: 10Chad) [01:26:53] (03Merged) 10jenkins-bot: Move autosetuprebase to where it will actually do something useful [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94301 (owner: 10Chad) [01:27:20] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [01:33:01] !log catrope synchronized wmf-config/InitialiseSettings.php 'fix logo path for wikimania2014wiki' [01:33:01] !log adding temporary index to S5 wikidatawiki.wb_terms for slow queries [01:33:20] Logged the message, Master [01:33:36] Logged the message, Master [01:33:56] !log catrope synchronized docroot/bits/static-current/ 'update static-current symlinks to wmf3' [01:34:15] Logged the message, Master [01:34:36] RoanKattouw: oops, thanks [01:35:26] ori-l: Not your fault, +2ers are responsible for insta-deploying in mw-config [01:35:49] ori-l: Wait, you were the +2er. Sorry, it was your fault ;) [01:46:05] ori updated common to I1e6c5a3b8 [01:46:27] (^ testing) [01:53:15] <^d> !log tin: set git config branch.*.rebase true for all deployed branches [01:53:24] <^d> RoanKattouw_away: So not just fixed in future, fixed now ^ [01:53:31] Logged the message, Master [01:54:00] !log ori updated common to I1e6c5a3b8: Fix how 'current' branch is determined in updateBitsBranchPointers [01:54:19] Logged the message, Master [01:54:31] (^ also a test; I'll remove from the SAL) [02:03:13] (03PS1) 10Ori.livneh: Revert "Update RC2UDP config to use $wgRCFeeds" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94312 [02:03:45] (03Abandoned) 10Chad: Revert "Update RC2UDP config to use $wgRCFeeds" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94312 (owner: 10Ori.livneh) [02:04:56] (03PS1) 10Chad: Don't enable RC to UDP feeds for labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94313 [02:06:33] (03PS2) 10Chad: Don't enable RC to UDP feeds for labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94313 [02:06:45] (03CR) 10Chad: [C: 032 V: 032] Don't enable RC to UDP feeds for labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94313 (owner: 10Chad) [02:09:47] !log LocalisationUpdate completed (1.23wmf2) at Fri Nov 8 02:09:47 UTC 2013 [02:10:03] Logged the message, Master [02:14:41] Oh [02:14:47] Does labs set something else to false? [02:17:42] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [02:19:02] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [02:23:40] !log deployed Parsoid 67fca5bdc7 [02:23:58] Logged the message, Master [02:25:34] <^d> Reedy: It left $wgRC2UDPPrefix at the default of false. [02:26:15] <^d> You changed the trigger to be a new global $wmgUseRC2UDP, and didn't set it to false for labs. [02:29:00] !log LocalisationUpdate completed (1.23wmf3) at Fri Nov 8 02:28:59 UTC 2013 [02:29:16] Logged the message, Master [02:31:39] ^d: shouldn't beta labs be incapable of pushing to production udp in the first place? [02:31:48] <^d> One would think. [02:33:26] There's still that bug about beta labs sending emails to the new projects list [02:33:50] https://bugzilla.wikimedia.org/show_bug.cgi?id=48786 [02:33:58] is the sender address different ? [02:34:19] the list admin could just block that sender [02:34:23] in recipient filters [02:34:57] ah, already has that comment on it [02:35:52] "it is just about adapting the notifyNewProjects to [02:35:53] have it using a different email." [02:36:52] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [02:38:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 3.50 ms [02:49:02] PROBLEM - Host analytics1011 is DOWN: CRITICAL - Plugin timed out after 15 seconds [02:50:02] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [02:57:42] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [02:59:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [03:02:12] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [03:03:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 1.42 ms [03:06:03] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [03:09:13] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:17:21] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Nov 8 03:17:20 UTC 2013 [03:17:37] Logged the message, Master [03:30:14] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [03:32:13] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [03:32:13] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [03:34:03] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [03:41:54] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [03:43:23] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [04:06:41] (03CR) 10Chad: "Meh, don't have to symlink. But with all the cleanups it'll be way easier to just drop these in as-is." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [04:08:38] AaronSchulz: ping [04:25:38] (03CR) 10Yurik: [C: 031] Add an extra header for cache variance of W0 banners for proxies. [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [04:34:54] (03PS1) 10Chad: nostalgiawiki gets Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94316 [04:36:35] (03CR) 10Chad: [C: 032] nostalgiawiki gets Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94316 (owner: 10Chad) [04:39:16] (03Merged) 10jenkins-bot: nostalgiawiki gets Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94316 (owner: 10Chad) [04:40:36] !log demon synchronized cirrus.dblist [04:40:51] Logged the message, Master [04:42:46] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [04:45:16] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [04:45:26] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [04:46:26] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [04:56:36] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [04:57:36] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [05:03:06] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [05:04:26] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [05:14:39] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [05:16:09] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [05:59:19] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [06:03:19] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:10:00] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [06:10:30] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [06:24:28] !log analytics1012 down, power mgmt firmware initialization error, opened ticket #6238 [06:24:50] Logged the message, Master [06:32:10] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [06:32:30] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [06:40:22] (03PS1) 10Ori.livneh: Add a githook for logging repo modification on tin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94319 [06:43:36] (03CR) 10Ori.livneh: [C: 032] Add a githook for logging repo modification on tin [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94319 (owner: 10Ori.livneh) [06:45:02] !log ori updated /a/common to I70534e64e: nostalgiawiki gets Cirrus [06:45:18] Logged the message, Master [06:45:43] hmmm. not quite right. [06:51:39] (03PS1) 10Ori.livneh: Remove '--first-parent' arg from rev-list invocation in logmsg-git-hook [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94320 [06:51:40] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [06:51:58] (03CR) 10Ori.livneh: [C: 032] Remove '--first-parent' arg from rev-list invocation in logmsg-git-hook [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94320 (owner: 10Ori.livneh) [06:52:39] !log ori updated /a/common to I3691bbf3a: Remove '--first-parent' arg from rev-list invocation in logmsg-git-hook [06:52:59] Logged the message, Master [06:53:10] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [06:54:48] !log ori synchronized logmsg-git-hook 'logmsg-git-hook is meant only for tin, but syncing it for consistency' [06:55:04] Logged the message, Master [06:55:49] ori-l: is it reasonable to wrap the sha with {{Gerrit|sha}} for easy clicking on the SAL? [06:56:23] I thought about that, but remember the message goes to IRC & Twitter, too [06:56:34] and on Wikitech we have a Lua template that does the trick [06:56:35] yeah... [06:56:46] oh? [06:57:12] the one I botched by mucking up adminbot's hash regexp a while ago [06:57:42] oh right, so it's known that SAL doesn't show your last tests as linkified [06:58:32] i thought it would, tbh [06:58:39] let me look at the Lua template, sec [07:01:34] brr, I misremembered [07:01:44] you're right, the ones that are linkified come with {{Gerrit}} [07:01:54] i guess we should follow the convention of annotating for wikitext [07:01:59] so I'll take the suggestion [07:02:26] wait, [07:03:17] * greg-g waits [07:05:11] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [07:05:57] (03PS1) 10Ori.livneh: logmsg-git-hook: wrap Change-Ids in {{Gerrit|...}} [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94321 [07:06:16] (03CR) 10Ori.livneh: [C: 032] logmsg-git-hook: wrap Change-Ids in {{Gerrit|...}} [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94321 (owner: 10Ori.livneh) [07:06:16] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [07:06:16] PROBLEM - puppet disabled on analytics1011 is CRITICAL: Connection refused by host [07:06:45] : [07:06:46] :) [07:07:16] oohhhhh [07:07:16] RECOVERY - puppet disabled on analytics1011 is OK: OK [07:07:27] I missed a '}' [07:07:34] god damn it [07:07:45] obviously shouldn't be self merging [07:07:49] :P [07:08:29] (03PS1) 10Ori.livneh: Fix typo in logmsg-git-hook [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94322 [07:08:43] (03CR) 10Ori.livneh: [C: 032] Fix typo in logmsg-git-hook [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94322 (owner: 10Ori.livneh) [07:09:02] !log ori updated /a/common to {{Gerrit|I027abe363}}: Fix typo in logmsg-git-hook [07:09:06] wee [07:09:16] Logged the message, Master [07:09:38] (03PS1) 10ArielGlenn: add back thm1/2 to dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/94323 [07:10:05] that should come in handy, I hope [07:11:24] incremental improvements, always good [07:11:51] yes, commit + N typo fixes :) [07:11:52] (03CR) 10ArielGlenn: [C: 032] add back thm1/2 to dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/94323 (owner: 10ArielGlenn) [07:11:58] very incremental [07:12:09] hey, some of us take smaller steps than others [07:12:13] its ok [07:25:06] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [07:25:26] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 7.74 ms [08:15:02] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [08:16:32] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [08:21:40] morning [08:22:02] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [08:22:32] morning! [08:23:04] what's up with analytics? [08:23:22] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [09:12:01] (03PS5) 10Faidon Liambotis: Add all Asian countries in the list [operations/dns] - 10https://gerrit.wikimedia.org/r/80974 [09:12:02] (03PS1) 10Faidon Liambotis: Switch traffic back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/94325 [09:12:44] (03CR) 10Faidon Liambotis: [C: 032] Switch traffic back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/94325 (owner: 10Faidon Liambotis) [09:15:36] (03CR) 10Faidon Liambotis: [C: 032] Add all Asian countries in the list [operations/dns] - 10https://gerrit.wikimedia.org/r/80974 (owner: 10Faidon Liambotis) [09:39:07] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:27] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [09:43:37] PROBLEM - search indices - check lucene status page on search1003 is CRITICAL: Connection timed out [09:44:27] RECOVERY - search indices - check lucene status page on search1003 is OK: HTTP OK: HTTP/1.1 200 OK - 269 bytes in 0.001 second response time [09:44:27] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [09:46:27] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 2.63 ms [09:49:57] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:27] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [09:54:57] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [09:59:27] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 2.30 ms [10:27:14] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [10:28:32] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [10:43:57] (03PS4) 10Mark Bergsma: Repartition eqiad LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92343 [10:47:03] (03Abandoned) 10Mark Bergsma: Repartition esams LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92344 (owner: 10Mark Bergsma) [10:48:02] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [10:48:32] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [10:48:44] apergos: is this you? [10:57:31] (03PS5) 10Mark Bergsma: Repartition eqiad LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92343 [10:58:02] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [10:59:12] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 4.56 ms [11:05:59] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [11:09:09] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [11:18:51] what, analytics10**? nope [11:19:24] I only looked at 1012 because it was down for 5 hours according to icinga [11:20:15] (03PS6) 10Mark Bergsma: Repartition eqiad LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92343 [11:21:59] (03PS1) 10Mark Bergsma: Change IP of osm-lb.eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/94337 [11:22:26] (03CR) 10Mark Bergsma: [C: 032] Change IP of osm-lb.eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/94337 (owner: 10Mark Bergsma) [11:24:24] (03PS1) 10Mark Bergsma: Change osm-lb.eqiad IP address [operations/puppet] - 10https://gerrit.wikimedia.org/r/94339 [11:26:01] (03CR) 10Mark Bergsma: [C: 032] Change osm-lb.eqiad IP address [operations/puppet] - 10https://gerrit.wikimedia.org/r/94339 (owner: 10Mark Bergsma) [11:32:42] (03PS1) 10Mark Bergsma: Add reverse DNS for new upload-lb.eqiad IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/94341 [11:34:07] (03CR) 10Mark Bergsma: [C: 032] Add reverse DNS for new upload-lb.eqiad IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/94341 (owner: 10Mark Bergsma) [11:39:14] (03PS1) 10Mark Bergsma: Move parsoid-lb IPv6 address into the right new range [operations/dns] - 10https://gerrit.wikimedia.org/r/94342 [11:40:00] (03CR) 10Mark Bergsma: [C: 032] Move parsoid-lb IPv6 address into the right new range [operations/dns] - 10https://gerrit.wikimedia.org/r/94342 (owner: 10Mark Bergsma) [11:41:49] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:20] well these reboots are no good... 9 reboots since early this morning on analytics1014, 4 on analytics1011, 2 on analytics1013 [11:42:25] guess tht might be 3 now [11:43:09] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [11:43:35] yup [11:45:21] (03PS1) 10Mark Bergsma: Add new upload-lb.eqiad IP addresses according to the new Zero scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/94343 [11:46:24] (03CR) 10Mark Bergsma: [C: 032] Add new upload-lb.eqiad IP addresses according to the new Zero scheme [operations/puppet] - 10https://gerrit.wikimedia.org/r/94343 (owner: 10Mark Bergsma) [11:48:55] (03PS1) 10Mark Bergsma: Add the new upload-lb.eqiad IP addresses to the protoproxies [operations/puppet] - 10https://gerrit.wikimedia.org/r/94344 [11:49:59] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [11:50:30] (03CR) 10Mark Bergsma: [C: 032] Add the new upload-lb.eqiad IP addresses to the protoproxies [operations/puppet] - 10https://gerrit.wikimedia.org/r/94344 (owner: 10Mark Bergsma) [11:51:19] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [12:07:40] RECOVERY - DPKG on palladium is OK: All packages OK [12:08:00] RECOVERY - Puppetmaster HTTPS on palladium is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.074 second response time [12:19:21] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [12:20:30] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [12:27:10] PROBLEM - Backend Squid HTTP on sq80 is CRITICAL: Connection refused [12:37:20] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [12:39:20] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [12:40:15] (03PS1) 10Akosiaris: Disallow commit, merge, rebase on backends private [operations/puppet] - 10https://gerrit.wikimedia.org/r/94348 [12:47:50] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [12:48:21] (03CR) 10Akosiaris: [C: 032] Disallow commit, merge, rebase on backends private [operations/puppet] - 10https://gerrit.wikimedia.org/r/94348 (owner: 10Akosiaris) [12:50:10] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [12:50:28] so... I am thinking about temporarily sending all uslfo machines to palladium for puppet as test. Any objections ? [12:51:57] would you want to send one and run puppetd --test on it? or you have already done this? [12:56:14] (03PS1) 10Mark Bergsma: Create a backend_random director, and use it for login requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94350 [12:57:08] (03CR) 10jenkins-bot: [V: 04-1] Create a backend_random director, and use it for login requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94350 (owner: 10Mark Bergsma) [12:58:34] (03PS2) 10Mark Bergsma: Create a backend_random director, and use it for login requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94350 [12:58:41] (03CR) 10jenkins-bot: [V: 04-1] Create a backend_random director, and use it for login requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94350 (owner: 10Mark Bergsma) [13:12:44] apergos: yes i have already done that [13:12:55] nice [13:12:57] mangled cp4001 /etc/hosts [13:13:13] hmm I have no objections, if you are around to check on them in an hour just in case [13:13:22] it would be nice for them to "just work" [13:13:27] I 'll be around [13:13:36] yeah write... plug and pray ? [13:13:40] right* [13:13:42] :-D [13:13:48] man... what did I just write ? [13:13:56] right :-D [13:14:49] so yeah go to town, you can kick me in an hour and I'll run my 'check on puppet runs' script too, which is not guaranteed but will at least give an overview if something went badly awry [13:15:07] though [13:15:25] s/though// (taking it back) [13:29:23] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [13:29:52] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [13:30:27] well no console messages at the time of reboot, batches of cpu power messages from time to time earlier [13:30:48] no messages in hadoop log right at time of boot (prev msg was a few minutes earlier) [13:31:22] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [13:32:32] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:41] (03PS1) 10Akosiaris: Puppetmaster backends optimizations [operations/puppet] - 10https://gerrit.wikimedia.org/r/94352 [13:39:51] (03CR) 10Akosiaris: [C: 032] Puppetmaster backends optimizations [operations/puppet] - 10https://gerrit.wikimedia.org/r/94352 (owner: 10Akosiaris) [13:48:42] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:22] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [13:59:08] (03PS1) 10Arav93: Renamed $wmfConfigDir to $wmgConfigDir in mediawiki-config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94354 [13:59:47]