[00:00:10] apergos, mutante ^^ should protect us from some of this until we end up ripping out the ancient compat mode that lead to the "can't find scap" part of the issue. [00:00:16] awesome [00:02:01] I have to leave so that I don't leave my partner at the bus station far away on a Friday night. Do you all need anything else from me? [00:02:24] no_justification: can you be aroud to test/take up where thcipriani left off, in case it's needed? [00:02:54] 67% [00:02:54] (note it's 2 am here, so... just sayin :-P) [00:02:55] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to analytics-privatedata-users group for Jonas Kress - https://phabricator.wikimedia.org/T182908#3842053 (10RobH) Everything on this looks good, just pending the 3 day wait for objections. This ends on Tuesday, 20... [00:03:18] I can be available after 45 minutes of driving, FWIW. [00:03:26] 70.7% (338/478) success ratio [00:03:38] thcipriani: go do what you gotta do [00:03:42] yea, run! [00:03:48] thanks :) [00:04:23] I'm around [00:04:36] closing laptop, I'll check back in when I can [00:04:40] ok. probably want you to at least test, once we think the right versions are in place [00:04:47] k, see ya thcip [00:05:24] mutante: use disable-puppet reason, not puppet --disable ;) [00:05:33] volans|off: :-P :-P [00:05:36] see wikitech cumin for an example [00:05:57] it's gonna be enabled again in 10 minutes with any luck [00:06:50] yes but you'll enable also on host that might have it disabled for other reasons [00:06:59] and should not be re-enabled [00:07:07] does enable check for the message? [00:07:11] volans|off: !ok [00:07:20] ie can you somehow specify to... [00:07:27] oh, I see what you would have in mind. [00:07:35] well, live and learn [00:07:43] * apergos thinks about it [00:08:21] the scripta are in our puppet, easier to read them than explain them at 1am ;) [00:08:50] cumin magic! and pretty cool magic at that [00:09:02] * volans|off note to self improve puppet scripts doc [00:09:37] uh huh [00:09:51] at 2 am I admit I'm even less in shape for it [00:10:40] !log no more scap 3.7.4-2 found across 'R:Package = scap' (T183046) [00:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:53] T183046: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046 [00:11:48] mutante: do you need to verify also if there is any scap installed not by puppet? [00:16:18] checks scap version on * [00:16:43] you can add and not the previous query ;) [00:16:44] aborted.. [00:17:41] We should make sure /srv/deployment/scap* is gone from everywhere too. It *should* be, but good to know for sure [00:17:56] Er, it's not [00:17:58] Figures [00:18:11] cumin -x '* and not R:Package = scap' 'dpkg -l | grep scap' [00:19:05] confirmed there is no 3.7.4-2 on * [00:19:53] ok I guess I can go to sleep then :) [00:20:03] nighty night [00:21:18] you too, see ya [00:22:28] so I should try deploying now? :D [00:23:51] mutante, no_justification: ^ [00:25:25] !log demon@tin Synchronized README: Testing (duration: 00m 57s) [00:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:36] apergos: 3 exceptions found [00:25:39] Well that's progress [00:26:01] let's have 'em [00:26:08] labweb1001-1003 [00:26:18] are the only ones that had puppet disabled longer than this [00:26:23] 1001-1002 [00:26:28] and tin [00:26:29] so just re-disable em right after [00:26:33] yea [00:26:56] !log re-enabling puppet on scap hosts [00:27:05] legoktm: give us just a tiny bit more time please? [00:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:11] ok [00:27:13] like 3 minutes I swear [00:27:25] I do need to leave in ~30 min, but someone else can sync it out if it comes to that [00:27:28] and it would be awesome of the first scap were a test to make sure stuff is working [00:27:35] *if the [00:28:08] !log re-disabled puppet on labweb1001/labweb1002 (as it was before) [00:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:32] RECOVERY - nutcracker port on labweb1002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [00:29:00] is tin safe to re-enable or leave enabled or whatever? [00:29:02] RECOVERY - Apache HTTP on labweb1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 620 bytes in 0.639 second response time [00:29:03] RECOVERY - nutcracker process on labweb1002 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [00:29:05] ^ i don't know how.. puppet is dsiabled [00:29:12] I don;t remember what is going on there [00:29:12] RECOVERY - HHVM rendering on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 74414 bytes in 0.217 second response time [00:29:33] RECOVERY - HHVM processes on labweb1002 is OK: PROCS OK: 6 processes with command name hhvm [00:29:38] there was no reason specified and on 1001 the reason was "this is serving traffic somehow?" [00:30:06] it has 3.7.4-2 [00:30:13] and puppet just ran [00:30:17] on tin I mean [00:30:18] i renabled tin [00:30:20] yea [00:30:20] ah ok [00:30:29] so wait... 3.7.4-? [00:30:31] 2? [00:30:50] yeah it does [00:30:52] so uh? [00:31:07] well ,duh [00:31:14] stupid :) [00:31:18] we need to merge [00:32:06] (03PS2) 10Dzahn: Revert "Scap: bump version to 3.7.4-2" [puppet] - 10https://gerrit.wikimedia.org/r/398603 (owner: 10Thcipriani) [00:32:10] legoktm: might be another 5-10 mins, sorry [00:32:28] is doing that again but quicker this time :p [00:33:08] I already ran a test ;-) [00:33:14] no_justification: cool [00:33:41] !log demon@tin Synchronized README: Testing (duration: 00m 57s) [00:33:43] :) [00:35:43] i am downgrading again.. [00:35:47] yep I thought [00:35:58] hopefully it is only a handful of hosts now though [00:36:09] how many would ave had time to run puppet [00:36:30] yea. but the group was surprisingly large [00:36:53] ok, done.. NOW merge [00:36:56] heh [00:36:58] (03CR) 10Dzahn: [C: 032] Revert "Scap: bump version to 3.7.4-2" [puppet] - 10https://gerrit.wikimedia.org/r/398603 (owner: 10Thcipriani) [00:38:09] merged on master, re-enabling puppet [00:38:42] labweb1001/1002 off again.. that's it [00:38:46] awesome [00:38:50] no_justification: NOW test pelase [00:38:52] *please [00:39:07] should be ok but still [00:39:45] mutante wht do you think about pulling 3.7.4-2 from the repo and putting 3.7.4-1 back in? yes, no? [00:40:03] !log demon@tin Synchronized README: Testing again, this time with feeling (duration: 00m 56s) [00:40:05] first of all.. why does the output keep changing [00:40:10] when i keep re-checking the version [00:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:16] eh? [00:41:14] yea.. [00:41:15] thorium [00:41:20] why... [00:41:28] is that a trusty? [00:41:36] no, jessie [00:44:58] !log demon@tin Synchronized php-1.31.0-wmf.12/extensions/LoginNotify/includes/LoginNotify.php: T182867 (duration: 00m 57s) [00:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:07] T182867: "Login to Wikidata as QuickStatementsBot from a computer you have not recently used" - https://phabricator.wikimedia.org/T182867 [00:45:37] legoktm: Sync'd for you [00:45:43] ty :) [00:45:56] there is _still_ the newer version on some hosts .. sigh [00:45:59] * legoktm waits for emails that will hopefully never come so he can close this as resolved [00:46:05] it didnt re-enable puppet on everything either [00:46:09] keeps checking [00:46:12] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [00:46:36] thank you mutante and apergos and volans|off :) (and no_justification and thcipriani|afk too) [00:46:49] we're not quite done :-( [00:46:52] and I really paologize [00:47:02] well, the puppet thing is change, so there is no reason for those to come back now [00:47:06] I went ahead and sync'd his stuff as another test. [00:47:07] right [00:47:11] So he's unblocked [00:47:41] legoktm: But really, thank you! I'd rather you notice this today than everyone start disappearing for the holidays and it get discovered when someone tried to do an emergency deployment [00:47:46] can you verify that it made it around everywhere, no_justification, though? [00:47:49] ^ this ! [00:48:20] Verify that...what? The package got downgraded? Best I can tell it has best scap can tell [00:48:23] ok, at this point all I can suggest is to do host lists, [00:48:26] There weren't any failures. [00:48:32] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [00:48:34] make sure the new version goes away and the old one gets in [00:48:42] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [00:48:44] tedious but I don't see an alternative [00:48:52] * apergos scowls at the snapshots [00:49:03] eventlog1001 ? did you do that? [00:49:06] nope [00:49:07] I don't know how I would possibly verify that except by hand? [00:49:08] that's 3.7.3-1 [00:49:13] Over....400 servers? [00:49:14] I shall go look at the snapshots however [00:49:24] no_justification: no obviously that's no good [00:50:11] I'm pretty dang sure it's ok for mw* and related stuff [00:50:20] Basically anything a MW sync would hit [00:50:59] So everything in /etc/dsh/group/mediawiki-installation seems ok [00:51:08] mw1193 [00:51:13] there's a not exactly random sample [00:51:13] i repeated that stuff.. i am manually fixing labweb1002 [00:51:15] is it there? [00:51:20] that seemed the only exception on _this_ run [00:51:29] dpkg: error: cannot access archive '/var/cache/apt/archives/scap_3.7.4-1_all.deb': No such file or directory [00:51:33] labweb1002 [00:51:42] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [00:51:51] labweb1002 is stretch [00:51:57] the only stretch host with scap ? [00:52:11] nah, far from it [00:52:21] E: Version '3.7.4-1' for 'scap' was not found [00:52:23] snapshots [00:52:30] why? because it's not in the trusty repo of course [00:52:56] sca1004: error due to the puppet change [00:53:00] it will be that for all the trusties that don't have the file in /var/lib/whatsit [00:53:01] yep [00:53:14] grrrr [00:53:28] so my previous question: obviously they are stable as to scap itself [00:53:39] but do we want to pull 3.7.4-2 from the repo and put 3.7.4-1? [00:53:40] or... [00:53:42] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [00:53:52] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [00:53:53] actually I hate that idea [00:53:54] I think that's a good idea, yes [00:54:00] lol [00:54:04] well because [00:54:13] the trusty hosts can't install [00:54:26] oh. they'll just silently fail out like they did on 3.7.4-2 [00:54:27] yeah [00:54:28] let's [00:54:40] mutante, what do you think? [00:55:01] i am thinking "how did they do this the other times a scap version changed" ? [00:55:06] hahahaha [00:55:50] well, yea, we can delete it [00:55:53] it's broken [00:56:06] there will have to be 3.7.4-3 [00:56:26] We can't go back to -1? [00:56:47] why can't we [00:56:51] just go to 1, it's in puppet [00:57:07] apt-get update on all the hosts will get the new list [00:57:14] well, there was a reason to try and make a -2 version ? [00:57:20] but something went wrong [00:57:23] yes but that can be rebuilt later [00:57:31] as -2 or -3 or whatever [00:57:44] definitely not the same number though [00:57:52] it will just give you reprepro trouble [00:58:11] ok well you probably have more experience there [00:59:45] actually i dont know how you remove just the latest version and downgrade [01:01:02] the other versions wil already be gone [01:01:17] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [01:01:23] should be able to check that by listing [01:02:09] do you know who imported the last one? [01:03:09] i see lots of older versions in other home dirs [01:03:44] basically everything up to 3.7.3 but not this [01:03:47] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [01:05:18] hrm [01:08:17] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [01:09:57] * apergos whistles cheerfully [01:11:17] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [01:11:17] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [01:11:43] !log reprepro remove trusty-wikimedia scap [01:11:47] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [01:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:56] !log reprepro removing scap 3.7.4-2 package, attempting to reimport 3.7.4-1 package [01:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:02] ERROR: Unexpected content of file './scap_3.7.4.orig.tar.gz'! [01:26:28] where could i find the "old" orig.tar.gz [01:26:32] for the last version [01:38:12] I was going to say something about how ensure=$version seems like a not-great practice, that we avoid elsewhere [01:38:23] (or ensure=>latest) [01:38:38] but ensure=>latest seems to have 91 hits in "git grep" :/ [01:38:50] we still don't like it [01:39:57] it's hard to easily pull out a state for ensure=>$version, because there's a lot of use-cases for variables on the RHS there that don't boil down to fixed versions [01:40:02] s/state/stat/ [01:40:30] well, so fixed the md5/sha1/sha256 sums in the dsc file and the changes file [01:40:34] now "Could not find any key matching '09DBD9F93F6CD44A'! [01:42:36] !log reimported scap 3.7.4-1 into APT (jessie-wikimedia) after fixing md5/sha sums in .dsc and .changes files to match orig.tar.gz [01:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:55] one more time... this time with the right GPG key found [01:49:14] !log reimported scap 3.7.4-1 into APT (jessie-wikimedia) after fixing md5/sha sums in .dsc and .changes files to match orig.tar.gz | copied it from jessie-wikimedia to trusty and stretch-wikimedia. all distributions downgraded to 3.7.4-1 (T183046) [01:49:20] scap | 3.7.4-1 | trusty-wikimedia | amd64, i386, source [01:49:20] scap | 3.7.4-1 | jessie-wikimedia | amd64, i386, source [01:49:20] scap | 3.7.4-1 | stretch-wikimedia | amd64, i386, source [01:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:24] T183046: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046 [01:51:02] Notice: /Stage[main]/Scap/Package[scap]/ensure: ensure changed '3.7.3-1' to '3.7.4-1' [01:51:05] ^ yay! [01:51:08] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [01:51:08] eventlog1001 [01:51:17] that was me: puppet runs clean [01:51:31] there are gonna be a ton of recovery spams in here in 3...2...1 [01:51:33] ran it too [01:51:56] anyways, done at last! [01:52:14] I might wander off because [01:52:18] it's about 4 am now [01:52:20] unless we have to run apt-get update [01:52:22] on all [01:52:29] ah I did that without thinking on snapshot1005 [01:52:33] yeah prolly good idea [01:52:33] same [01:53:19] doesn't puppet do an apt-get update in there someplac? [01:53:27] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:53:37] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:53:43] yes, it does :) [01:53:58] all right, so this run or the next they'll start flooding us [01:54:07] I'll be long gone, heh [01:55:05] it's only 15 .. [01:55:06] Hey guys, im getting scap puppet errors on cloud vps [01:55:34] Zppix: could you try: apt-get update and run puppet again [01:55:40] it should have been fixed just now [01:55:51] If i could ssh in yes :/ [01:55:53] oh wait [01:56:01] they probably have the broken scap version, yea [01:56:26] beta should be running master [01:56:30] So unrelated [01:57:48] fixes silver, snapshot1001 [02:00:48] no_justification: should labsweb1002 be running the 3.7.4-1 version? ie remove, install via dpkg? [02:01:09] yea, that's the only exception. i was gonna just do that [02:01:11] (puppet disabled because that's how it was before everything started0 [02:01:17] ok [02:01:17] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:01:35] labsweb* are already busted anyway [02:01:44] apergos: what i really wanted is that confirmation there was only that 1() case left [02:01:49] it's all I saw [02:01:53] :) great [02:02:12] my eyes are 4 am eyes though, do bear in mind [02:02:18] still, they made it this far! :-D [02:02:43] gotta copy that file from elsewhere.. one sec [02:02:49] There's a task but I can't find it [02:03:17] copy which? apt-get update won't just find it in the repo? or...? [02:03:26] nm you know what you're doing, my brain already checked out [02:03:38] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:03:57] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:04:29] no_justification: no worries [02:04:45] just as long as we don't leave it in a worse state than it was [02:04:45] !log labweb1002 - manually downgrade to scap 3.7.4-1 (disabled puppet) [02:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:18] apergos: thank you for all the help [02:05:23] you should really get sleep now [02:05:25] thanks for doing all the work [02:05:27] it's done i would say [02:05:29] yep I'm outta here [02:05:32] good night! [02:05:36] have a weekend! [02:05:39] you too! [02:08:08] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:11:08] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:11:08] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:11:13] ok, icinga is clean (from this type of error) [02:11:28] and it was even confirmed on another channel that cloud vps was fine again [02:11:31] out [02:11:38] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:11:47] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:13:38] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:15:20] 10Operations, 10Scap, 10Patch-For-Review: scap 3.7.4-2 is broken - https://phabricator.wikimedia.org/T183046#3842160 (10Dzahn) p:05Unbreak!>03High the whole cluster has been downgraded to 3.7.4-1 , checked with cumin the broken version has been removed from APT ``` [install1002:~] $ sudo -E reprepro... [03:22:08] (03CR) 10Thcipriani: [C: 031] scap: Set bin_dir globally to /usr/bin [puppet] - 10https://gerrit.wikimedia.org/r/398606 (https://phabricator.wikimedia.org/T183046) (owner: 10Chad) [03:24:17] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 796.20 seconds [03:50:27] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 128.71 seconds [08:31:27] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:31:27] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:31:28] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:38] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:38] PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100% [08:31:38] PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100% [08:31:38] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:38] PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:47] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:57] PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100% [08:32:38] PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:07] PROBLEM - Request latencies on argon is CRITICAL: CRITICAL - apiserver_request_latencies is 455420 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:35:07] RECOVERY - Host etcd1005 is UP: PING OK - Packet loss = 0%, RTA = 2.68 ms [08:35:07] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 2.67 ms [08:35:08] RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 2.66 ms [08:35:08] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 3.12 ms [08:35:08] RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 2.64 ms [08:35:08] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 2.72 ms [08:35:10] RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 2.59 ms [08:35:11] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 2.62 ms [08:35:17] RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 2.76 ms [08:35:37] RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [08:35:47] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 1.52 ms [08:36:07] RECOVERY - Request latencies on argon is OK: OK - apiserver_request_latencies is 10493 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:04:23] 10Operations, 10Dumps-Generation, 10HHVM, 10Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#3842337 (10ArielGlenn) This is probably going to turn into "update to stretch and php7", but we're waiting on the RFC last call, due to close Jan 10th, befo... [12:10:23] (03PS1) 10ArielGlenn: use cat to recombine gzipped files together [dumps] - 10https://gerrit.wikimedia.org/r/398634 (https://phabricator.wikimedia.org/T182572) [12:15:01] (03CR) 10ArielGlenn: [C: 032] use cat to recombine gzipped files together [dumps] - 10https://gerrit.wikimedia.org/r/398634 (https://phabricator.wikimedia.org/T182572) (owner: 10ArielGlenn) [12:16:32] !log ariel@tin Started deploy [dumps/dumps@faf7de8]: use cat to recombine gzipped files [12:16:34] !log ariel@tin Finished deploy [dumps/dumps@faf7de8]: use cat to recombine gzipped files (duration: 00m 02s) [12:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:08] (03PS1) 10ArielGlenn: Revert "use cat to recombine gzipped files together" [dumps] - 10https://gerrit.wikimedia.org/r/398637 [12:28:00] (03PS2) 10ArielGlenn: Revert "use cat to recombine gzipped files together" [dumps] - 10https://gerrit.wikimedia.org/r/398637 [12:28:58] (03CR) 10ArielGlenn: [C: 032] Revert "use cat to recombine gzipped files together" [dumps] - 10https://gerrit.wikimedia.org/r/398637 (owner: 10ArielGlenn) [12:29:45] !log ariel@tin Started deploy [dumps/dumps@95dbfe6]: revert previous deploy [12:29:47] !log ariel@tin Finished deploy [dumps/dumps@95dbfe6]: revert previous deploy (duration: 00m 02s) [12:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:07] PROBLEM - MariaDB Slave SQL: s8 on db2045 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Update_rows_v1 event on table wikidatawiki.tag_summary: Duplicate entry 182949546 for key tag_summary_rev_id, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1071-bin.005998, end_log_pos 237690866 [14:40:08] PROBLEM - MariaDB Slave SQL: s5 on db2052 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Update_rows_v1 event on table wikidatawiki.tag_summary: Duplicate entry 182949546 for key tag_summary_rev_id, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1070-bin.001655, end_log_pos 808458079 [14:44:38] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1949 bytes in 0.127 second response time [14:47:57] PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 605.33 seconds [14:48:07] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 615.59 seconds [14:48:07] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.47 seconds [14:48:08] PROBLEM - MariaDB Slave Lag: s8 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.34 seconds [14:48:08] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.51 seconds [14:48:08] PROBLEM - MariaDB Slave Lag: s8 on db2081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.65 seconds [14:48:08] PROBLEM - MariaDB Slave Lag: s8 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.67 seconds [14:48:08] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 622.19 seconds [14:48:27] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 638.74 seconds [14:48:28] PROBLEM - MariaDB Slave Lag: s8 on db2045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.27 seconds [14:48:37] PROBLEM - MariaDB Slave Lag: s8 on db2080 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 647.47 seconds [14:50:42] <_joe_> uhmmmm [14:51:07] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [14:52:47] I am taking a look [14:53:05] I was on it before you pinged me [14:53:12] <_joe_> jynus: yeah the errors on mediawiki are about "could not wait for replica" [14:53:24] ? [14:53:50] <_joe_> which is surprising [14:53:54] <_joe_> [{exception_id}] {exception_url} Wikimedia\Rdbms\DBReplicationWaitError from line 373 of /srv/mediawiki/php-1.31.0-wmf.12/includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to db1070 [14:54:04] _joe_, I told you loadbalancer is broken [14:54:12] <_joe_> well even cross-dc? [14:54:15] <_joe_> I thought not [14:54:23] in all thinkables ways [14:54:28] <_joe_> sigh [14:54:37] <_joe_> can I do anything to help? [14:56:11] (03PS1) 10Jcrespo: mariadb: Depool db1100, broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398645 [14:56:12] I am depooling db1100 [14:56:20] being a human load balancer [14:56:26] <_joe_> :/ [14:56:30] can you +1 it? [14:56:32] <_joe_> so the problem is not with codfw [14:56:49] I do not know what is the problem [14:57:01] I am doing that, even if I shouldn't do it [14:57:10] shouldn't have to do it [14:57:13] <_joe_> why are you also changing the weight on db1082 ? [14:57:25] <_joe_> oh I see [14:57:26] <_joe_> sorry [14:57:33] (03CR) 10Giuseppe Lavagetto: [C: 031] mariadb: Depool db1100, broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398645 (owner: 10Jcrespo) [14:57:48] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1100, broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398645 (owner: 10Jcrespo) [14:58:02] (03CR) 10jenkins-bot: mariadb: Depool db1100, broken [mediawiki-config] - 10https://gerrit.wikimedia.org/r/398645 (owner: 10Jcrespo) [14:58:05] _joe_, imagine you had to manually depool app servers [14:58:09] when one breaks [14:58:16] <_joe_> jynus: yeah I hear you [14:58:21] <_joe_> this is super painful [14:58:23] <_joe_> and wrong [15:00:14] and to depool app servers, you would have to make a patch and review it :-) [15:00:50] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1100 (duration: 00m 58s) [15:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:38] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1922 bytes in 0.092 second response time [15:09:08] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [15:19:17] RECOVERY - MariaDB Slave SQL: s5 on db2052 is OK: OK slave_sql_state Slave_SQL_Running: Yes [15:26:18] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 57.60 seconds [15:26:37] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 7.74 seconds [15:27:07] RECOVERY - MariaDB Slave Lag: s5 on db2089 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [15:27:08] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:27:17] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:27:17] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [16:10:18] RECOVERY - MariaDB Slave SQL: s8 on db2045 is OK: OK slave_sql_state Slave_SQL_Running: Yes [16:28:57] RECOVERY - MariaDB Slave Lag: s8 on db2080 is OK: OK slave_sql_lag Replication lag: 36.29 seconds [16:29:28] RECOVERY - MariaDB Slave Lag: s8 on db2081 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:29:28] RECOVERY - MariaDB Slave Lag: s8 on db2085 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:29:37] RECOVERY - MariaDB Slave Lag: s8 on db2086 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [16:29:48] RECOVERY - MariaDB Slave Lag: s8 on db2045 is OK: OK slave_sql_lag Replication lag: 0.08 seconds [17:06:17] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2008335 [17:06:30] no_justification, does fixing of accounts for https://phabricator.wikimedia.org/T152640 require taking gerrit offline? or is it just an admin using gsql to add a db row and flush caches? [17:09:30] No offlineness [17:11:29] Upgrading to 2.14 should make it go away forever tho [17:13:48] Well, in the meantime, ashley's been locked out of his account for almost two weeks now. [17:14:15] Oh shit I forgot about that. [17:14:32] Lemme find my laptop [17:15:27] Ah. >.< [17:16:12] Also: I shouldn't be the only one who can (or should) do these.... [17:16:16] * no_justification mutters some profanity [17:16:24] no_justification, in the mean time, maybe provide the other admins with instructions to fix accounts? [17:16:34] It's...on....the.....task [17:21:14] ...spell it out more for the slow kids? >.> [17:21:35] !log gerrit: halting service momentarily for account reindexing [17:21:40] Although I would hope the actual admins are less slow than I am. [17:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:18] So wait [17:22:33] No offlineness [17:22:35] But it's offline [17:23:18] Yeah, because I changed my mind [17:23:26] And decided to do an offline reindex of accounts [17:23:34] ok [17:23:37] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.85 and port 29418: Connection refused [17:23:44] ^ That's me, it'll be back [17:24:35] you rebooting the box? [17:24:37] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.13.9-2-g99a8c8bc51-dirty (SSHD-CORE-1.2.0) (protocol 2.0) [17:24:49] oh no wait, port 29418 [17:25:10] didn't expect icinga to monitor for that [17:25:43] Yeah, it monitors the service(s) [17:26:07] PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [17:26:24] And as always, the cascading failures due to git pulls in puppet [17:27:13] !log gerrit: Back, might see a few transient puppet failures if git pulls happened during the d/t, but should all recover [17:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:07] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [17:46:27] Is that the solution, then? Just reindex and it's fixed? [17:47:15] No, re-add the missing entry to the database, then reindex [17:48:24] Ah, okay. [17:48:26] Thanks. [17:50:23] * no_justification goes back to playing Nintendo [17:50:25] :P [17:50:30] heh :) [17:54:07] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:56:07] RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:46:17] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 4 [23:13:55] no_justification: thank you for your hard work :) alas, I'm going to have to be a spoilsport and report back with this error: "Cannot assign user name "ashley" to account 5555; name already in use." [23:19:01] Gerrit just doesnt like you does it? [23:19:41] seems that way (but the feeling's been mutual for a few years, so... :P) [23:29:20] ashley i think this will be fixed in gerrit 2.14 :). We just need to re add your external id to the database then do a online reindex for your account. [23:34:33] paladox: any idea when that's gonna be deployed, then? being able to do code review and submit patches etc. is kinda essential stuff that I can't do when I'm locked out of gerrit >.< [23:35:37] I doin't have a eta, but we have been testing with it :). [23:35:40] there's a task https://phabricator.wikimedia.org/T156120