[01:05:32] (03PS45) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [01:29:29] (03PS1) 10Subramanya Sastry: Update linter stats for commonswiki less frequently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404233 (https://phabricator.wikimedia.org/T184280) [02:11:10] 10Operations, 10DBA, 10Release-Engineering-Team, 10cloud-services-team, 10Chinese-Sites: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3899755 (10Shizhao) [02:31:04] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.15) (duration: 07m 50s) [02:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:52] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 762.29 seconds [03:55:52] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 255.11 seconds [04:12:40] 10Operations, 10DBA, 10Release-Engineering-Team, 10cloud-services-team: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3899801 (10Liuxinyu970226) @Shizhao I'm sorry but Jcrespo hasn't decision about zhwiki here [05:24:26] (03CR) 10Legoktm: [C: 031] Update linter stats for commonswiki less frequently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404233 (https://phabricator.wikimedia.org/T184280) (owner: 10Subramanya Sastry) [06:13:08] !log Deploy schema change on db1070 (s5 master) - T174569 [06:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:21] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:15:49] 10Operations, 10ops-codfw, 10DBA: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888#3899829 (10Marostegui) [06:16:16] 10Operations, 10ops-codfw, 10DBA: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888#3899841 (10Marostegui) p:05Triage>03Normal [06:16:45] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3899843 (10Marostegui) p:05Triage>03Normal [06:29:33] 10Operations, 10ops-codfw, 10DBA: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888#3899844 (10Papaul) @Marostegui sorry but we don't have any used BBU from a decommissioned host that we can use . (we have no decommissioned HP servers) [06:34:36] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184787#3899845 (10Papaul) p:05Triage>03Normal [06:35:34] 10Operations, 10ops-codfw, 10DBA: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888#3899858 (10Marostegui) Thanks @Papaul - I have checked the hosts that will soon be decommissioned and none of them are HP. @RobH any ideas on what can we do about this? [06:36:55] 10Operations, 10ops-codfw: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#3899860 (10Papaul) p:05Triage>03Normal [06:37:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404236 (https://phabricator.wikimedia.org/T162807) [06:38:15] (03PS2) 10Marostegui: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404236 (https://phabricator.wikimedia.org/T162807) [06:40:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404236 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:42:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404236 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:42:36] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404236 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:44:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1065 to fix data drifts on the archive table - T162807 (duration: 01m 13s) [06:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:26] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [06:52:15] (03PS1) 10Marostegui: db1065.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/404237 (https://phabricator.wikimedia.org/T148507) [06:52:17] !log Upgrade MariaDB on db1065 [06:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:34] (03CR) 10Marostegui: [C: 032] db1065.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/404237 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui) [07:00:21] PROBLEM - Check Varnish expiry mailbox lag on cp4021 is CRITICAL: CRITICAL: expiry mailbox lag is 2110620 [07:01:36] (03CR) 10Gergő Tisza: [C: 031] Stop rewriting m.wikipedia.org and zero.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/404158 (https://phabricator.wikimedia.org/T69015) (owner: 10Mholloway) [07:11:52] !log Deploy schema change on silver (labswiki) and labtestweb2001 (labtestwiki) - T174569 [07:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:05] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [07:22:38] (03PS1) 10Marostegui: db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397) [07:23:54] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui) [07:24:23] (03PS2) 10Marostegui: db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397) [07:24:59] (03PS3) 10Marostegui: db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397) [07:27:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui) [07:28:48] (03Merged) 10jenkins-bot: db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui) [07:28:58] (03CR) 10jenkins-bot: db-eqiad.php: Replace db1063 with db1087 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404258 (https://phabricator.wikimedia.org/T184397) (owner: 10Marostegui) [07:30:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Replace db1063 with db1087 as vslow in s8 (duration: 01m 12s) [07:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:00] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [07:40:40] (03PS5) 10Giuseppe Lavagetto: wmflib: simplify the role() function, convert to the new API [puppet] - 10https://gerrit.wikimedia.org/r/402345 [07:41:54] <_joe_> !log disabling puppet in all of production before merging https://gerrit.wikimedia.org/r/402345 [07:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:04] (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: simplify the role() function, convert to the new API [puppet] - 10https://gerrit.wikimedia.org/r/402345 (owner: 10Giuseppe Lavagetto) [07:50:06] <_joe_> !log forcing puppet run on the puppetmasters to force pluginsync for function change [07:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:55] <_joe_> !log reenabling puppet on all systems where it was previously enabled, after various testing [07:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:24] (03PS2) 10Giuseppe Lavagetto: hiera: port nuyaml to hiera 3 [puppet] - 10https://gerrit.wikimedia.org/r/402346 [08:11:39] !log lvs400[56]: upgrade to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267 [08:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:52] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [08:15:35] !log rebooting terbium for kernel security update [08:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:55] terbium is back up [08:20:21] RECOVERY - Check Varnish expiry mailbox lag on cp4021 is OK: OK: expiry mailbox lag is 0 [08:22:56] !log rebooting bast1001 for kernel security update [08:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:01] bast1001 is back up [08:29:03] ºo/ [08:29:07] \o/ [08:29:32] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1945 bytes in 0.099 second response time [08:38:52] PROBLEM - MariaDB Slave Lag: x1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 337.40 seconds [08:39:03] PROBLEM - MariaDB Slave Lag: x1 on db2033 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 336.17 seconds [08:40:47] that looks bad [08:41:13] it is the BBU [08:41:25] and the raid policy is in WT [08:41:32] https://phabricator.wikimedia.org/T184888 [08:42:27] !log reboot wezen for kernel security update [08:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:55] I want to RESET SLAVE ALL on db1031 [08:44:05] ok with that? [08:44:30] sounds good [08:44:56] !log disconnecting codfw -> eqiad replication for x1 [08:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:37] is it just me or the load on x1 has increased, too [08:47:51] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1031&var-port=9104 [08:49:18] but just now, no? [08:49:21] (per that graph) [08:50:00] well, when it started lagging [08:50:15] and probably why it did [08:51:57] the spike looks gone now [08:52:05] (the write spike I mean) [08:52:12] cool, maybe slave catches up [08:59:31] <_joe_> anyone looking at the wikidata dispatch lag? [08:59:51] no, we were busy with the hw issue [09:00:05] <_joe_> yeah not you, I meant other opsens [09:00:47] pattern not found [09:00:54] jynus: we can force db2033 to be WB, but I would prefer not to do so [09:00:57] that looks like a configuration/content problem [09:00:57] <_joe_> yeah scratch that message [09:01:17] <_joe_> jynus: nope, it's not. It's lagging, the alert is misleading [09:01:32] marostegui: let's not do it if it doesn't create user problems [09:01:45] the wikidata.org lag is probably due to the terbium reboot, it probably needs to catch up [09:01:53] Agreed, let's see if r*bh has some ideas about how we can replace the BBU [09:01:54] ah, that would explain it [09:02:18] <_joe_> moritzm: I'd expect it would in more than half an hour [09:03:23] did terbium boot back up, how long ago? [09:03:27] not sure, Amir mentioned that less than 10 mins of non-availability of terbium should be fine and the reboot took maybe two [09:03:50] jynus: yeah, it came back up at 8:18 UTC [09:03:56] and took maybe two mins [09:04:25] then the check is probably not well puppetized/depending on somthing stateful [09:05:14] not the check, probably the dispatch itself [09:05:20] https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&panelId=5&fullscreen&orgId=1 [09:05:53] if that is the load of terbium probably not very reliable [09:06:09] https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&panelId=1&fullscreen&orgId=1 is worrying [09:06:29] Amir1 ^ [09:06:44] I just got here [09:06:56] sorry let me take a took [09:07:27] !log reboot aqs1004 for kernel updates [09:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:19] jynus: It's definitely related to terbium restart [09:08:46] I think we have the similar problem like the last time, the way to fix it is to remove the lock in redis [09:08:52] there is 4 crons running [09:08:58] maybe blocked? [09:09:26] <_joe_> Amir1: this is ridiculous and needs to be fixed as a UBN ticket [09:09:34] <_joe_> Amir1: what lock on redis? [09:09:43] <_joe_> and yes, the lag is indeed very high [09:09:54] !log upgrading Zuul on contint1001 | https://gerrit.wikimedia.org/r/#/c/356181/ [09:10:00] let me fix it ASAP [09:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:09] <_joe_> Amir1: which lock on redis? [09:10:16] I am going to guess it is the think it used to be on the wikidata masters and I asked it to be removed [09:10:23] <_joe_> we're creating endless locks? [09:10:38] jynus: exactly [09:10:54] _joe_: no, they have expiry but I think it's 2 hours [09:11:02] <_joe_> jynus: the problem is not the storage medium, it's the idea of an endless lock [09:11:03] let me check config [09:11:15] !log upgrading Zuul on contint2001 (zuul-merger) | https://gerrit.wikimedia.org/r/#/c/356181/ [09:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:29] <_joe_> Amir1: the real problem here, imho, is that we're still running this thing as concurrent crons [09:13:36] <_joe_> instead than as a proper service [09:14:10] _joe_: would love to help getting it done [09:14:16] <_joe_> if you had a proper service, with shutdown handlers, it would've removed the lock and stop on a reboot [09:14:29] ok, let's focus on the current issue [09:14:42] is there something we can do to fix it now? [09:14:57] the only thing for me is the place it should go, otherwise I think it's not that hard to implement [09:15:00] <_joe_> jynus: I think the current issue is known and Amir1 has a solution at hand [09:15:10] do you? [09:15:24] jynus: two ways: 1- we do nothing until the lock expire [09:15:37] 2- clean the locks from redis lock manager [09:15:52] can we do #2? [09:15:54] I'm looking in more depth [09:16:06] jynus: yup, done it before [09:16:10] addshore did [09:19:09] I also can do it, just need to find the related mediawiki code to do in eval.php [09:20:21] something seems to be being executed looking at the graphs [09:20:50] yeah but since most of them are locked, they think, there is nothing to do [09:24:07] _joe_: can you flush out all keys in redis that start with 'Wikibase.wikidatawiki.dispatchChanges'? [09:24:14] *looks up* [09:24:35] It seems the locks are getting expired [09:24:43] <_joe_> Amir1: I think you can, yes, but since I do see some changes being dispatched, I'd first need to kill the applications [09:24:44] Yeh, I have some pending changes to change the lock ttl [09:24:51] And change the lock manager [09:25:11] Amir1: I need to get 3 or so patches merged in wikibase first [09:25:15] <_joe_> addshore: which redis servers does this connect to? [09:25:24] *opens laptop* [09:25:41] The lack manager is defined in filebacked.php [09:25:44] *lock [09:25:46] <_joe_> because it seems it doesn't go via nutcracker [09:26:04] <_joe_> oh you mean the traditional redis lock manager in mediawiki [09:26:08] https://github.com/wikimedia/operations-mediawiki-config/blob/1290e7fcffd6dd8834ee1a85a378aa3646a88e6a/wmf-config/filebackend.php [09:26:10] <_joe_> if only I knew :P [09:26:11] yup [09:26:36] we *abuse* file lock manager for dispatching [09:26:53] I have patches up to switch to a different lock manager with a different ttl [09:27:12] not sure if we have a ticket for it as I was just doing it because I thought it should be done [09:27:26] _joe_: it seems it's getting back up: https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&panelId=1&fullscreen&orgId=1 [09:27:43] honestly, at this point I would leave it as it is [09:27:46] Should we let it recover naturally or flush out every thing? [09:27:52] <_joe_> yeah me too [09:27:54] it could create an overload on the databases? [09:28:05] <_joe_> I'm seeing quite a few thread able to work [09:28:08] if it starts applying changes too fast [09:28:13] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1939 bytes in 0.072 second response time [09:28:20] https://gerrit.wikimedia.org/r/#/c/395967 https://gerrit.wikimedia.org/r/#/c/395969 https://gerrit.wikimedia.org/r/397535 https://gerrit.wikimedia.org/r/397536 [09:28:36] <_joe_> addshore: I'd be more interested in transforming this in a proper production service [09:28:40] the other issue is the alert- something went wrong on the check [09:28:48] <_joe_> instead of a list of ever-overlapping cron scripts [09:28:51] jynus: I highly doubt that, at this time, changes are too little that even they held up they don't create much issue [09:28:56] <_joe_> jynus: what went wrong? [09:29:01] maybe it checks redis and it couldn't get the info if nothing is happening? [09:29:06] _joe_: me too, this was just to avoid us having these locks occasionally last for 2 hours and confuse everyone :) [09:29:14] <_joe_> eheh ok [09:29:15] _joe_: the graphs kept working [09:29:23] <_joe_> jynus: what didn't work, sorry? [09:29:25] showing the right dispatch [09:29:35] but the check for wikidata dispatch didn't say "high lag" [09:29:37] <_joe_> the "pattern not found" is what the alert, written like it is now [09:29:45] it said taht [09:29:49] <_joe_> will show you when the lag is > than 300 seconds [09:29:57] <_joe_> it's by design, it's not broken [09:30:02] really? [09:30:03] <_joe_> one can say it's a poor design [09:30:17] <_joe_> yes, it uses check_http under the hood [09:30:24] I don't get it [09:30:33] what pattern are we talking? [09:31:01] <_joe_> jynus: it requests https://www.wikidata.org/w/api.php?action=query&meta=siteinfo&format=json&siprop=statistics [09:31:18] ok, then the problem is the response there [09:31:18] <_joe_> and then parses it with a regex to see if the lag is 300 seconds or less [09:31:22] not the check [09:31:33] <_joe_> if it's not, say it's 400, it says "pattern not found" [09:31:43] <_joe_> no, the problem is the check that is abysmal [09:31:53] the response shoud say something more explicit [09:32:01] <_joe_> --ereg '"median":[^}]*"lag":([\ [09:32:03] <_joe_> 1-2]?[0-9]?[0-9]|300),' [09:32:08] <_joe_> this is what it does [09:32:11] <_joe_> sigh :P [09:32:12] in fact, there is not a reason why it shouldn't show the dispatch lag [09:32:22] grafana does [09:32:34] we could even set an alert based on grafana [09:32:39] you know what I mean? [09:32:53] hmm, I thought it was based on grafana... [09:32:55] !log reboot kafka2001 for kernel updates [09:33:05] at least, when I wrote the original check it was based on graphite data [09:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:09] jynus: you can make an alert on grafana [09:33:12] http_check may be doing the right thing, but the response/check don't [09:33:14] that seems easier [09:33:31] so, as a follow up [09:33:41] and this is only my opinion [09:33:54] I would like to see 2 tickets, one about app architecture [09:34:03] oh wait, or was that for wdqs lag.. hmmm [09:34:08] and get some ops involved [09:34:16] and the other about the check [09:34:26] jynus: I believe there are already tickets about the dispatching architecture / overhaul, and there have been for some time [09:34:32] but this is just the optinion of someone that as very little involvement [09:34:48] (03PS1) 10Filippo Giunchedi: restbase: reprovision restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/404262 (https://phabricator.wikimedia.org/T184100) [09:35:09] well, I complained when it was on the wikidata masters, but I am not 100% sure this is better [09:35:15] also, it needs some redundancy [09:35:36] https://phabricator.wikimedia.org/T178652 is the ticket regarding the current lock manager timeout [09:35:42] terbium is a SPOF- we use it for obvious maintenance [09:36:13] and we can put it down and set another server, but that is thought for things like generating special pages [09:36:31] where time is not a huge issue [09:36:52] Here is an epic covering dispatching in general https://phabricator.wikimedia.org/T108944 [09:37:33] Parent ticket about using the jobqueue instead of the current system https://phabricator.wikimedia.org/T48643 [09:37:40] addshore: I do not think that covers everyhing I said, specially architecture and redudancy [09:38:02] that is software concerns, mine are mostly related to arquitecture and hardware [09:38:11] (03PS4) 10Ema: vcl: add hash function name to CHACHA20-POLY1305 cipher [puppet] - 10https://gerrit.wikimedia.org/r/398311 [09:38:31] so architecture not as in software architecture? [09:39:19] e.g. where should we run those from? how to setup a standalone service? how to provide high availability? how to improve alerts? [09:40:15] well, if it uses the jobqueue most of that is redundant? it would need to be a standalone service, it would run from the job runners, alerts, sure, could probably be improved, but there is not much to alert about than high lag [09:40:51] You guys think I can deploy an schema change on wikidata on codfw (not active dc)? or shall I wait? [09:41:37] Everything is back to normal AFAIK [09:42:31] I like using jobqueue as it reduces the hassle of dispatching for third parties (=my localhost) [09:43:16] (03CR) 10Mobrovac: [C: 031] restbase: reprovision restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/404262 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [09:44:04] Amir1: Thanks - I will deploy then [09:45:14] !log Deploy schema change on s8 codfw master (db2045) with replication (this will generate lag on s8 codfw) - T174569 [09:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:27] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [09:46:07] (03CR) 10Filippo Giunchedi: [C: 032] restbase: reprovision restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/404262 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [09:51:09] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "do not merge until we use hiera 3.x in production as well." [puppet] - 10https://gerrit.wikimedia.org/r/402346 (owner: 10Giuseppe Lavagetto) [09:52:54] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404264 (https://phabricator.wikimedia.org/T162807) [09:55:42] (03CR) 10Volans: "Few minor comments inline." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani) [09:55:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404264 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:57:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404264 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:57:19] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404264 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:58:25] !log rolling reboots of aqs hosts (1005->1009) for kernel updates [09:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 - T162807 (duration: 01m 09s) [09:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:01] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [10:06:05] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please just limit your work to adding the new servers. Reordering can be done in a logical way at a later time." [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [10:06:26] (03CR) 10Ema: [C: 032] vcl: add hash function name to CHACHA20-POLY1305 cipher [puppet] - 10https://gerrit.wikimedia.org/r/398311 (owner: 10Ema) [10:06:33] (03PS5) 10Ema: vcl: add hash function name to CHACHA20-POLY1305 cipher [puppet] - 10https://gerrit.wikimedia.org/r/398311 [10:06:36] (03CR) 10Ema: [V: 032 C: 032] vcl: add hash function name to CHACHA20-POLY1305 cipher [puppet] - 10https://gerrit.wikimedia.org/r/398311 (owner: 10Ema) [10:08:27] 10Operations, 10monitoring: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3900022 (10fgiunchedi) Given that it isn't that many metrics, I think it might be simpler to keep the standard jmx exporter configuration on the puppetdb side and drop the metri... [10:12:16] 10Operations, 10monitoring: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3900057 (10elukey) >>! In T184796#3900022, @fgiunchedi wrote: > Given that it isn't that many metrics, I think it might be simpler to keep the standard jmx exporter configuratio... [10:12:45] PROBLEM - proxysql processes on terbium is CRITICAL: PROCS CRITICAL: 0 processes with command name proxysql [10:12:55] (03PS4) 10Elukey: site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) [10:15:46] !log reboot wasat for kernel security update [10:15:51] proxysql probably doesn't start automaticaly after restart [10:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:02] as it is a test install [10:16:36] !log start proxysql on terbium [10:16:45] RECOVERY - proxysql processes on terbium is OK: PROCS OK: 1 process with command name proxysql [10:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:10] (03PS1) 10Jcrespo: mariadb: switchover s2 codfw master from db2017 to db2035 [puppet] - 10https://gerrit.wikimedia.org/r/404268 (https://phabricator.wikimedia.org/T176243) [10:20:25] (03PS1) 10Elukey: role::prometheus::ops: add puppetdb metrics [puppet] - 10https://gerrit.wikimedia.org/r/404269 (https://phabricator.wikimedia.org/T184796) [10:20:44] (03PS1) 10Jcrespo: mariadb: Switchover codfw s2 master from db2017 to db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404270 (https://phabricator.wikimedia.org/T176243) [10:20:51] (03CR) 10Giuseppe Lavagetto: [C: 031] site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [10:22:50] !log starting codfw s2 master switchover [10:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:15] (03CR) 10Jcrespo: [C: 032] mariadb: switchover s2 codfw master from db2017 to db2035 [puppet] - 10https://gerrit.wikimedia.org/r/404268 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [10:27:51] !log upgrade and restart db2035 [10:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:46] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, LGTM generally" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) (owner: 10Dzahn) [10:37:46] (03PS9) 10Ema: Smarter Varnish slow log [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) (owner: 10Gilles) [10:38:00] (03CR) 10Jcrespo: [C: 032] mariadb: Switchover codfw s2 master from db2017 to db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404270 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [10:38:32] (03CR) 10Ema: [C: 032] Smarter Varnish slow log [puppet] - 10https://gerrit.wikimedia.org/r/399176 (https://phabricator.wikimedia.org/T181315) (owner: 10Gilles) [10:39:54] (03Merged) 10jenkins-bot: mariadb: Switchover codfw s2 master from db2017 to db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404270 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [10:40:07] (03CR) 10jenkins-bot: mariadb: Switchover codfw s2 master from db2017 to db2035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404270 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [10:40:48] (03PS2) 10Elukey: role::prometheus::ops: add puppetdb metrics [puppet] - 10https://gerrit.wikimedia.org/r/404269 (https://phabricator.wikimedia.org/T184796) [10:44:12] RECOVERY - HHVM processes on labweb1001 is OK: PROCS OK: 6 processes with command name hhvm [10:46:41] ^labweb is nme [10:48:47] !log Upgrading zuul to 2.5.1 on contint1001 / contint2001 [10:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:01] (03PS3) 10Elukey: role::prometheus::ops: add puppetdb metrics [puppet] - 10https://gerrit.wikimedia.org/r/404269 (https://phabricator.wikimedia.org/T184796) [10:50:03] !log reboot kafka2002 for kernel updates [10:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:42] (03CR) 10Filippo Giunchedi: [C: 032] role::prometheus::ops: add puppetdb metrics [puppet] - 10https://gerrit.wikimedia.org/r/404269 (https://phabricator.wikimedia.org/T184796) (owner: 10Elukey) [10:51:02] !log jynus@tin Synchronized wmf-config/db-codfw.php: Switchover codfw s2 master from db2017 to db2035 (duration: 01m 12s) [10:51:03] !log Upgrading zuul to 2.5.1 on contint1001 / contint2001 | T158243 [10:51:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404274 [10:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:23] T158243: Update zuul to upstream master - https://phabricator.wikimedia.org/T158243 [10:51:27] !log s2 codfw master swithover finished [10:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:57] !log lowering disk watermark on elasticsearch eqiad to shuffle shards around [10:52:58] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404274 (owner: 10Marostegui) [10:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:33] PROBLEM - DPKG on graphite1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:54:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404274 (owner: 10Marostegui) [10:55:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T162807 (duration: 01m 12s) [10:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:51] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [10:56:38] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404275 (https://phabricator.wikimedia.org/T128546) [10:56:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404274 (owner: 10Marostegui) [10:57:58] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404276 [10:58:01] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404276 [11:00:04] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180115T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:48] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404275 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:02:17] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404275 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:02:29] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404275 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:04:10] (03PS1) 10Filippo Giunchedi: restbase: don't consider sde for restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/404277 (https://phabricator.wikimedia.org/T184100) [11:04:49] (03CR) 10Filippo Giunchedi: [C: 032] restbase: don't consider sde for restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/404277 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [11:08:06] !log bootstrap cassandra-a on restbase1017 [11:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:41] !log jdrewniak@tin Synchronized portals/prod/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:402805|Bumping portals to master (T128546)]] (duration: 01m 14s) [11:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:53] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:10:56] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:402805|Bumping portals to master (T128546)]] (duration: 01m 14s) [11:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:10] RECOVERY - DPKG on restbase1011 is OK: All packages OK [11:11:10] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [11:11:20] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [11:11:20] RECOVERY - dhclient process on restbase1011 is OK: PROCS OK: 0 processes with command name dhclient [11:11:21] RECOVERY - Check size of conntrack table on restbase1011 is OK: OK: nf_conntrack is 9 % full [11:11:21] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [11:11:30] RECOVERY - configured eth on restbase1011 is OK: OK - interfaces up [11:11:30] RECOVERY - MD RAID on restbase1011 is OK: OK: Active: 15, Working: 15, Failed: 0, Spare: 0 [11:11:40] RECOVERY - Check whether ferm is active by checking the default input chain on restbase1011 is OK: OK ferm input default policy is set [11:11:44] that's me ^ [11:11:50] RECOVERY - Disk space on restbase1011 is OK: DISK OK [11:11:50] RECOVERY - cassandra-a service on restbase1011 is OK: OK - cassandra-a is active [11:11:50] RECOVERY - cassandra-b service on restbase1011 is OK: OK - cassandra-b is active [11:16:40] RECOVERY - DPKG on graphite1002 is OK: All packages OK [11:18:51] RECOVERY - Check the NTP synchronisation status of timesyncd on restbase1011 is OK: OK: synced at Mon 2018-01-15 11:18:49 UTC. [11:21:11] (03PS2) 10Filippo Giunchedi: Scap: bump version to 3.7.6-1 [puppet] - 10https://gerrit.wikimedia.org/r/404219 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [11:21:23] !log upload scap 3.7.6-1 - T127762 [11:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:39] T127762: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762 [11:21:56] (03CR) 10Filippo Giunchedi: [C: 032] Scap: bump version to 3.7.6-1 [puppet] - 10https://gerrit.wikimedia.org/r/404219 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [11:22:30] RECOVERY - IPMI Sensor Status on restbase1011 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [11:26:14] (03PS1) 10Ema: varnishslowlog: do not crash on empty reqheader values [puppet] - 10https://gerrit.wikimedia.org/r/404279 [11:29:25] (03CR) 10Giuseppe Lavagetto: [C: 032] site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [11:29:32] (03PS5) 10Giuseppe Lavagetto: site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [11:30:10] (03CR) 10Ema: [C: 032] varnishslowlog: do not crash on empty reqheader values [puppet] - 10https://gerrit.wikimedia.org/r/404279 (owner: 10Ema) [11:31:11] <_joe_> grr [11:31:21] (03PS6) 10Giuseppe Lavagetto: site.pp: add mw1338->48 [puppet] - 10https://gerrit.wikimedia.org/r/403928 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [11:32:16] 10Operations, 10monitoring, 10Patch-For-Review: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#3900253 (10elukey) Started a dashboard in https://grafana-admin.wikimedia.org/dashboard/db/puppetdb [11:32:29] (03PS1) 10Alexandros Kosiaris: kubernetes: Set IPv6 accept_ra to 2 [puppet] - 10https://gerrit.wikimedia.org/r/404281 [11:33:23] (03PS3) 10Volans: wmf-auto-reimage: improve resume capabilities [puppet] - 10https://gerrit.wikimedia.org/r/399161 (https://phabricator.wikimedia.org/T182702) [11:34:43] (03CR) 10Volans: [C: 032] wmf-auto-reimage: improve resume capabilities [puppet] - 10https://gerrit.wikimedia.org/r/399161 (https://phabricator.wikimedia.org/T182702) (owner: 10Volans) [11:36:47] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw13(3[8-9]|4[0-9]).* [11:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:37] (03CR) 10Alexandros Kosiaris: [C: 04-1] "That's not the only issue, still investigating" [puppet] - 10https://gerrit.wikimedia.org/r/404281 (owner: 10Alexandros Kosiaris) [11:42:20] PROBLEM - MegaRAID on labsdb1003 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [11:43:28] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3900277 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1338.eqiad.wmnet ``` The log can be fo... [11:45:18] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-pt-gl: Cleanup [debs/contenttranslation/apertium-pt-gl] - 10https://gerrit.wikimedia.org/r/403116 (owner: 10KartikMistry) [11:45:20] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-pt-ca: Cleanup [debs/contenttranslation/apertium-pt-ca] - 10https://gerrit.wikimedia.org/r/403109 (owner: 10KartikMistry) [11:45:22] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-oc-es: Cleanup [debs/contenttranslation/apertium-oc-es] - 10https://gerrit.wikimedia.org/r/403107 (owner: 10KartikMistry) [11:45:24] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-oc-ca: Cleanup [debs/contenttranslation/apertium-oc-ca] - 10https://gerrit.wikimedia.org/r/403106 (owner: 10KartikMistry) [11:48:59] !log reboot aqs1007 for kernel upgrades [11:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:19] ah already logged a rolling reboot, amending the sal [11:51:14] (03PS1) 10Ema: varnishslowlog: do not crash on empty respheader values [puppet] - 10https://gerrit.wikimedia.org/r/404282 [11:51:43] (03CR) 10Gilles: [C: 031] varnishslowlog: do not crash on empty respheader values [puppet] - 10https://gerrit.wikimedia.org/r/404282 (owner: 10Ema) [11:51:55] 10Operations, 10Parsoid, 10VisualEditor, 10HTTPS: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3900319 (10Deskana) [11:52:14] (03CR) 10Ema: [C: 032] varnishslowlog: do not crash on empty respheader values [puppet] - 10https://gerrit.wikimedia.org/r/404282 (owner: 10Ema) [11:54:11] !log rebooting ores1* for kernel security update [11:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:37] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404276 (owner: 10Marostegui) [11:56:03] (03CR) 10Paladox: [C: 031] "You will want to update to gerrit 2.14.7 :)." [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120) (owner: 10Chad) [11:57:02] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404276 (owner: 10Marostegui) [11:57:12] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404276 (owner: 10Marostegui) [11:58:59] ACKNOWLEDGEMENT - MegaRAID on labsdb1003 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough Marostegui Will be decommissioned soon https://phabricator.wikimedia.org/T184832 [11:59:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 - T162807 (duration: 01m 12s) [11:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:54] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [12:04:47] !log upgrade and restart db2017 [12:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:40] PROBLEM - proxysql processes on wasat is CRITICAL: PROCS CRITICAL: 0 processes with command name proxysql [12:14:40] RECOVERY - proxysql processes on wasat is OK: PROCS OK: 1 process with command name proxysql [12:20:49] (03PS1) 10Hashar: Fix FTBS when installing docs [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404283 [12:20:51] (03PS1) 10Hashar: Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) [12:25:52] (03CR) 10Hashar: [C: 032] Fix FTBS when installing docs [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404283 (owner: 10Hashar) [12:26:13] (03CR) 10Hashar: [C: 032] Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar) [12:26:29] (03CR) 10Hashar: [C: 032] Fix FTBS when installing docs [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404283 (owner: 10Hashar) [12:26:34] (03CR) 10Hashar: [C: 04-2] Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar) [12:30:43] (03PS1) 10Jcrespo: mariadb: Move db2036 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404286 (https://phabricator.wikimedia.org/T148507) [12:33:09] 10Operations, 10monitoring, 10User-fgiunchedi: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#3900380 (10akosiaris) >>! In T178690#3891946, @Dzahn wrote: > We need the following new dashboards / URLs (noticed as part of T183873): > > - service cluster A overview (... [12:33:42] PROBLEM - MD RAID on mw1338 is CRITICAL: Return code of 255 is out of bounds [12:35:47] new host, silencing [12:51:03] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:52:47] (03PS1) 10Gilles: Fix varnishslowlog logstash configuration [puppet] - 10https://gerrit.wikimedia.org/r/404288 [13:20:19] !log reboot kafka2003 for kernel upgrades [13:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:03] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:24:00] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3900465 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1338.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1338.eqiad.wmnet'] ``` [13:26:36] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3900473 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1338.eqiad.wmnet ``` The log can be fo... [13:29:50] 10Operations, 10ops-codfw, 10DBA: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888#3900477 (10Marostegui) Maybe we should force this host to be WB even without the BBU to make sure it catches up: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2F... [13:30:31] (03CR) 10Hashar: [C: 032] "recheck" [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404283 (owner: 10Hashar) [13:31:36] 10Operations, 10Cloud-Services, 10cloud-services-team: labvirt1021-1022 spam the dhcp server with requests - https://phabricator.wikimedia.org/T184909#3900478 (10Joe) [13:31:38] (03Merged) 10jenkins-bot: Fix FTBS when installing docs [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404283 (owner: 10Hashar) [13:35:23] PROBLEM - HHVM rendering on mw2217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:36:13] RECOVERY - HHVM rendering on mw2217 is OK: HTTP OK: HTTP/1.1 200 OK - 79216 bytes in 0.296 second response time [13:38:00] (03CR) 10Hashar: [C: 04-2] "recheck" [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar) [13:38:48] !log reboot eventlog1001 for kernel updates [13:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:21] (03CR) 10jerkins-bot: [V: 04-1] Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar) [13:41:43] !log starting rolling reboot of elasticsearch / cirrus eqiad for kernel upgrade [13:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:02] RECOVERY - Long running screen/tmux on restbase1011 is OK: OK: No SCREEN or tmux processes detected. [13:45:12] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3900500 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['mw1339.eqiad.wmnet', 'mw1340.eqiad.wmn... [13:52:04] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10MoritzMuehlenhoff) [13:52:26] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900512 (10MoritzMuehlenhoff) [13:55:18] hm, mwlog1001 says: 462 data error in /srv/mediawiki/php-1.31.0-wmf.15/extensions/Graph/includes/ApiGraph.php on line 125 [13:56:29] (03CR) 10Ema: [C: 032] Fix varnishslowlog logstash configuration [puppet] - 10https://gerrit.wikimedia.org/r/404288 (owner: 10Gilles) [13:59:02] (03PS1) 10Ladsgroup: Enable lua fine grained usage tracking in cawiki, cewiki, elwiki, kowiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404298 (https://phabricator.wikimedia.org/T184322) [14:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180115T1400). [14:00:04] kart_: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] o/ [14:00:16] I can SWAT today [14:00:23] (03PS2) 10Hashar: Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) [14:00:32] kart_: around for SWAT? [14:00:48] zeljkof: I added one thing to the swat, not testable :D [14:01:37] Amir1: looking... [14:01:44] (03CR) 10jerkins-bot: [V: 04-1] Rebuild for stretch-wikimedia [debs/pkg-php/php-ast] - 10https://gerrit.wikimedia.org/r/404284 (https://phabricator.wikimedia.org/T174338) (owner: 10Hashar) [14:02:42] Amir1: ok, so I deploy the patch, no mwdebug? [14:02:54] I can do it while waiting for kart_ to come :) [14:02:59] yup [14:03:06] Amir1: or do you want to deploy yourself? [14:03:25] zeljkof: I can :) [14:03:36] Amir1: please do then, go ahead [14:03:44] coool [14:03:47] I'll deploy kart_'s commit when he comes [14:04:10] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404298 (https://phabricator.wikimedia.org/T184322) (owner: 10Ladsgroup) [14:05:35] !log reboot rdb* hosts in codfw for kernel security update [14:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:50] (03Merged) 10jenkins-bot: Enable lua fine grained usage tracking in cawiki, cewiki, elwiki, kowiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404298 (https://phabricator.wikimedia.org/T184322) (owner: 10Ladsgroup) [14:06:54] (03CR) 10jenkins-bot: Enable lua fine grained usage tracking in cawiki, cewiki, elwiki, kowiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404298 (https://phabricator.wikimedia.org/T184322) (owner: 10Ladsgroup) [14:07:26] zeljkof: here now. [14:07:38] Sorry for delay. [14:07:50] kart_: no problem, Amir1 is deploying, you are next, in a few minutes [14:08:06] Mine is about to finish [14:08:46] kart_: +2d 404070, waiting for CI, will ping you when the commit is at mwdebug1002 (in a few minutes) [14:08:53] !log ladsgroup@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:404298|Enable lua fine grained usage tracking in some wikis (T184322)]] (duration: 01m 14s) [14:08:58] OK! [14:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:04] T184322: Enable fine grained lua tracking gradually in client wikis - https://phabricator.wikimedia.org/T184322 [14:09:29] Mine is done, just need to monitor it from now on [14:09:50] Amir1: ok, thanks, I will take over SWAT then [14:10:05] kart_: forgot to ask, do you want to deploy your commit yourself? [14:10:39] zeljkof: no. go ahead :D [14:11:18] kart_: sure, just asking, if you would like to deploy in the future, let me know, it is not black magic :) [14:12:07] 10Operations, 10Cloud-Services, 10cloud-services-team: labvirt1021-1022 spam the dhcp server with requests - https://phabricator.wikimedia.org/T184909#3900561 (10aborrero) p:05Triage>03Normal [14:12:27] zeljkof: yes. I know! Just bit noisy here so don't want to messup something. [14:13:20] kart_: no problem, that's what #releng is for :) we mess up things regardless of the noise [14:13:47] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900568 (10aborrero) p:05Triage>03High [14:14:47] 10Operations, 10cloud-services-team: Reboot of WMCS servers for meltdown kernel update - https://phabricator.wikimedia.org/T184910#3900502 (10aborrero) Should this be merged somehow into T184189 ? [14:14:55] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/404053 (owner: 10Dzahn) [14:15:58] 10Operations, 10Cloud-VPS, 10Toolforge, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3875329 (10aborrero) @MoritzMuehlenhoff reports in T184910 that there are servers just pending the reboot. Should that t... [14:19:57] kart_: the commit is at mwdebug1002, any estimate of how much time you need to test it? [14:20:30] zeljkof: 2 or 3 min. [14:20:46] kart_: great [14:20:53] let me know if I can deploy [14:23:49] zeljkof: testwiki is wmf16, right? [14:24:03] let me check... [14:24:43] https://tools.wmflabs.org/versions/ says 1.31.0-wmf.16 [14:25:16] ah. Checking again. [14:26:16] zeljkof: did you sync all files? [14:26:19] (03PS1) 10Filippo Giunchedi: hieradata: enable remaining restbase1017 instances [puppet] - 10https://gerrit.wikimedia.org/r/404300 (https://phabricator.wikimedia.org/T184100) [14:27:07] kart_: I ran `scap pull` at mwdebug1002, if that's what you are asking [14:27:44] okay. Wondering why patch has no effect yet. [14:27:48] the commit should be only at mwdebug1002, if that was the question [14:27:59] I did not deploy anywhere else yet [14:28:29] did you use the x-wikimedia-debug extension to test at mwdebug1002? [14:29:26] zeljkof: yes. I use that. [14:29:31] as usual. [14:30:42] kart_: I have just checked, I have ran all commands, the commit should be at mwdebug1002, I can't find any mistake I could have made [14:31:06] zeljkof: OK. Then let me try again, if that doesn't work, we will abandon patch. [14:31:16] It is not working as expected. [14:31:34] kart_: should I revert the patch? [14:31:46] (since it is already merged) [14:32:04] zeljkof: wait. Checking Nikerabbit too. [14:32:24] zeljkof: no other patches to SWAT, right? [14:32:34] So, we can take sometime to debug :) [14:32:37] PROBLEM - DPKG on mw1340 is CRITICAL: Return code of 255 is out of bounds [14:32:37] PROBLEM - dhclient process on mw1340 is CRITICAL: Return code of 255 is out of bounds [14:32:51] kart_: this is the only patch left, so there is time :) [14:34:27] PROBLEM - Disk space on mw1340 is CRITICAL: Return code of 255 is out of bounds [14:34:27] PROBLEM - mediawiki-installation DSH group on mw1340 is CRITICAL: Host mw1340 is not in mediawiki-installation dsh group [14:36:07] PROBLEM - HHVM processes on mw1340 is CRITICAL: Return code of 255 is out of bounds [14:36:07] PROBLEM - nutcracker port on mw1340 is CRITICAL: Return code of 255 is out of bounds [14:37:28] PROBLEM - High CPU load on API appserver on mw1340 is CRITICAL: Return code of 255 is out of bounds [14:37:47] PROBLEM - HHVM rendering on mw1340 is CRITICAL: connect to address 10.64.32.52 and port 80: Connection refused [14:37:47] PROBLEM - nutcracker process on mw1340 is CRITICAL: Return code of 255 is out of bounds [14:38:18] mw1340 is a new host, silencing [14:38:23] zeljkof: we're good. [14:38:30] zeljkof: cache is the culprit. [14:39:04] kart_: ok to deploy? [14:39:19] yes. [14:39:30] kart_: deploying [14:40:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This actually doesn't compile, see https://puppet-compiler.wmflabs.org/compiler02/9725/" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/392564 (owner: 10Dzahn) [14:40:42] !log zfilipin@tin Synchronized php-1.31.0-wmf.16/extensions/ContentTranslation: SWAT: [[gerrit:404070|CX1: Fix translation view UI overlaps (T184662 T184130)]] (duration: 01m 16s) [14:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:55] T184130: [wmf.15-regression] ContentTranslation page: the additional message not displayed correctly - https://phabricator.wikimedia.org/T184130 [14:40:55] T184662: [wmf.16 - regression] Cannot click Personal draft button - https://phabricator.wikimedia.org/T184662 [14:41:18] kart_: deployed! please check and thanks for deploying with #releng! ;) [14:41:33] zeljkof: Thanks a lot again! [14:41:45] !log EU SWAT finished [14:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:11] (03PS2) 10Filippo Giunchedi: hieradata: enable remaining restbase1017 instances [puppet] - 10https://gerrit.wikimedia.org/r/404300 (https://phabricator.wikimedia.org/T184100) [14:46:17] 10Operations, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#3900673 (10akosiaris) [14:46:28] !log upgrade and restart db2036 [14:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:02] lots of test wiki errors [14:48:18] at 14:25 [14:48:33] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2036 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404286 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [14:48:38] (03PS2) 10Jcrespo: mariadb: Move db2036 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404286 (https://phabricator.wikimedia.org/T148507) [14:51:53] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable remaining restbase1017 instances [puppet] - 10https://gerrit.wikimedia.org/r/404300 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [14:51:57] (03PS3) 10Filippo Giunchedi: hieradata: enable remaining restbase1017 instances [puppet] - 10https://gerrit.wikimedia.org/r/404300 (https://phabricator.wikimedia.org/T184100) [14:52:16] 10Operations, 10Kubernetes: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3900700 (10akosiaris) [14:52:31] 10Operations, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#3900713 (10akosiaris) [14:52:33] 10Operations, 10Kubernetes: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3900712 (10akosiaris) [14:54:49] (03PS1) 10Jcrespo: mariadb: Move db2043 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404301 (https://phabricator.wikimedia.org/T148507) [14:57:33] (03PS2) 10Jcrespo: mariadb: Move db2043 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404301 (https://phabricator.wikimedia.org/T148507) [14:57:39] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2043 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404301 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [14:58:05] !log upgrade and restart db2043 [14:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:10] PROBLEM - puppet last run on restbase1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[cassandra-c] [15:01:49] PROBLEM - cassandra-c service on restbase1017 is CRITICAL: NRPE: Command check_cassandra-c-state not defined [15:02:31] (03PS1) 10Jcrespo: mariadb: Move db2050 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404302 (https://phabricator.wikimedia.org/T148507) [15:07:48] yes yes [15:08:10] !log upgrade and restart db2050 [15:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:44] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2050 socket away from /tmp [puppet] - 10https://gerrit.wikimedia.org/r/404302 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [15:11:59] PROBLEM - cassandra-b CQL 10.64.32.131:9042 on restbase1017 is CRITICAL: connect to address 10.64.32.131 and port 9042: Connection refused [15:15:29] PROBLEM - cassandra-b service on restbase1017 is CRITICAL: NRPE: Command check_cassandra-b-state not defined [15:17:09] PROBLEM - cassandra-c CQL 10.64.32.132:9042 on restbase1017 is CRITICAL: connect to address 10.64.32.132 and port 9042: Connection refused [15:18:50] PROBLEM - cassandra-c SSL 10.64.32.132:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:19:29] 10Operations, 10Kubernetes: Validate whether the (implemented) standardized application environment works as expected - https://phabricator.wikimedia.org/T184923#3900773 (10akosiaris) [15:22:00] 10Operations, 10Kubernetes: Utilize the deployment pipeline (stretch) - https://phabricator.wikimedia.org/T184924#3900786 (10akosiaris) [15:22:31] 10Operations, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#3900799 (10akosiaris) [15:22:33] 10Operations, 10Kubernetes: Utilize the deployment pipeline (stretch) - https://phabricator.wikimedia.org/T184924#3900798 (10akosiaris) [15:23:21] 10Operations, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#3883571 (10akosiaris) [15:23:23] 10Operations, 10Kubernetes: Validate whether the (implemented) standardized application environment works as expected - https://phabricator.wikimedia.org/T184923#3900800 (10akosiaris) [15:23:31] 10Operations, 10Kubernetes: Validate whether the (implemented) standardized application environment works as expected - https://phabricator.wikimedia.org/T184923#3900773 (10akosiaris) p:05Triage>03Normal [15:23:39] 10Operations, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#3883571 (10akosiaris) p:05Triage>03Normal [15:23:49] 10Operations, 10Kubernetes: Serve at least 50% of Mathoid via kubernetes - https://phabricator.wikimedia.org/T184919#3900700 (10akosiaris) p:05Triage>03Normal [15:23:58] 10Operations, 10Kubernetes: Utilize the deployment pipeline (stretch) - https://phabricator.wikimedia.org/T184924#3900786 (10akosiaris) p:05Triage>03Normal [15:25:52] (03PS1) 10Giuseppe Lavagetto: profile::base: configure apt before installing any package [puppet] - 10https://gerrit.wikimedia.org/r/404304 [15:25:58] <_joe_> volans: ^^ [15:31:05] (03CR) 10Faidon Liambotis: apt: unattended-upgrades: add targetted upgrades script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [15:33:30] !log upgrade and restart db2057 [15:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:54] (03PS1) 10Giuseppe Lavagetto: profile::base: run the apt configuration before anything else [puppet] - 10https://gerrit.wikimedia.org/r/404305 [15:41:33] _joe_: I think it's like the 3rd time this has been proposed :P [15:41:43] I think you've proposed it before too! [15:41:45] it won't work [15:42:33] <_joe_> paravoid: in reality, it can work with some grease around the wheels, and we're having a pretty serious bug during the first installation we need to fix [15:42:54] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3900879 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1338.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1338.eqiad.wmnet'] ``` [15:42:56] not sure what you mean with grease [15:43:00] <_joe_> paravoid: in the past I asked why we weren't doing it :) [15:43:01] but as it is, it will just loop [15:43:10] dependency loops [15:43:38] <_joe_> I mean separating concerns between what needs to be configured before we try to download any package, and what can be configured just before a specific package [15:44:48] <_joe_> else we'll be unable to download packages from security.debian.org for most of the first puppet run [15:44:56] !log upgrade and restart db2074 [15:45:04] not sure why? [15:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:52] paravoid: the apt::conf and apt::pin in class apt() are not executed at the start as they should be, but way down in the first puppet run [15:46:03] but security.d.o is being set up by d-i [15:46:04] <_joe_> paravoid: the package resource for the provider "apt" apparently depends on /etc/apt/apt.conf in puppet 4 [15:46:09] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:f6f0:205::153) [15:46:12] <_joe_> so it gets removed early in the installation [15:46:23] <_joe_> and we don't have the proxy config anymore [15:46:24] mr1 down ? [15:46:32] <_joe_> seems so, just ipv6 [15:46:39] (and just OOB IPv6 at that) [15:46:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 and db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404306 (https://phabricator.wikimedia.org/T162807) [15:46:47] what gets removed and what removes it? [15:47:29] <_joe_> paravoid: /etc/apt/apt.conf [15:47:29] indeed OOB only but not just IPv6, it's IPv4 as well [15:47:42] <_joe_> which contains the proxy setting [15:47:55] <_joe_> as for what removes it, it is a file resource inside the apt class [15:48:19] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [15:48:20] <_joe_> now I think something is made depend on it, and I think it's the package resource with the debian provider [15:48:27] <_joe_> but I have to confirm reading the sources [15:48:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 and db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404306 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [15:49:08] we basically just need to order File['/etc/apt/apt.conf'] -> Apt::Conf (spaceship or all resources) as I see it [15:49:23] <_joe_> the other way around maybe? [15:49:33] <_joe_> you want the other confs to happen before you remove it [15:49:40] we have an ensure absent for /etc/apt/apt.conf [15:49:42] <_joe_> and yes, that was my third solutiuon [15:50:11] https://gerrit.wikimedia.org/r/#/c/167835/ [15:50:24] https://gerrit.wikimedia.org/r/#/c/169643/ [15:50:37] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 and db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404306 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [15:50:42] <_joe_> but that might cause loops too [15:50:44] and https://gerrit.wikimedia.org/r/#/c/179082/ [15:50:48] <_joe_> I'll test that next [15:51:09] <_joe_> the last one I remember :) [15:51:19] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.82 ms [15:51:22] but your patch is equivalent to https://gerrit.wikimedia.org/r/#/c/167835/ and won't work :( [15:51:35] <_joe_> paravoid: yeah we're testing things with volans [15:51:52] <_joe_> I'll try the apt::conf / apt::pin spaceships dependencies [15:51:57] I think ordering/dependency-wise we remove apt.conf, set up apt.conf.d, but don't ensure that apt doesn't run between those two steps [15:51:59] <_joe_> with /etc/apt/apt.conf [15:52:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1067 and db1089 - T162807 (duration: 01m 12s) [15:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:01] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [15:53:30] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms [15:53:49] (03CR) 10Faidon Liambotis: [C: 04-2] "Equivalent to I481bc29ba5f0b6fef8c61d16e9d1b5e1cfeb0c55, which got reverted by Iecc000fd0c93a428af9c9e8ea2aefa0dbe03313d because it was ca" [puppet] - 10https://gerrit.wikimedia.org/r/404305 (owner: 10Giuseppe Lavagetto) [15:53:53] I left a -2 JIC [15:54:09] if you re-use the changeset for a different approach I can remove [15:54:26] just leaving it there because as-is this will break all package installs :) [15:54:35] yeah we know ;) [15:54:36] <_joe_> as-is will break puppet [15:54:39] <_joe_> plain and simple [15:54:46] :) [15:54:48] as is it causes [15:54:49] Exec[apt-get update] => Class[Apt] => Stage[apt-config] => Stage[main] => Class[Base::Puppet::Puppet4] => Apt::Pin[puppet-all] => File[/etc/apt/preferences.d/puppet_all.pref] => Exec[apt-get update] [15:54:51] yup [15:54:59] or a number of other loops really [15:55:14] just explaining my -2 :) [15:55:47] !log Stop replication in sync db1067 and db1089 - T162807 [15:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:42] <_joe_> paravoid: we're basically exploring approaches to find the best one [15:59:09] <_joe_> because at the moment the appservers are borderline uninstallable without a fix [15:59:47] (03PS1) 10Giuseppe Lavagetto: apt: make apt::conf and apt::pin configs happen before removing apt.conf [puppet] - 10https://gerrit.wikimedia.org/r/404307 [16:03:00] yeah that should work [16:03:08] could add "before" statements to those apt::confs below [16:03:22] not sure if the pin is needed? [16:04:02] <_joe_> paravoid: it's not /strictly/ needed [16:04:43] <_joe_> but in general this should force puppet 4 to install both apt::pin and apt::conf directives *before* removing the apt.conf file [16:04:52] <_joe_> which should happen before puppet installs any package [16:04:58] not necessarily [16:05:17] the order between these and Packages isn't guaranteed, but do we care? [16:05:38] as long as apt.conf is ~ what puppet configure probably not [16:05:51] (03PS1) 10Alexandros Kosiaris: Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150) [16:05:53] (03PS1) 10Alexandros Kosiaris: grafana: Hieraize parameters [puppet] - 10https://gerrit.wikimedia.org/r/404309 (https://phabricator.wikimedia.org/T170150) [16:06:07] <_joe_> it kinda is guaranteed, more or less [16:06:32] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 and db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404306 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [16:06:38] (03CR) 10jerkins-bot: [V: 04-1] Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [16:11:59] (03PS1) 10Jcrespo: mariadb: Switchover s3 codfw master from db2018 to db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404310 (https://phabricator.wikimedia.org/T176243) [16:13:02] (03PS1) 10Alexandros Kosiaris: Deprecate passwords::grafana::labs [labs/private] - 10https://gerrit.wikimedia.org/r/404311 (https://phabricator.wikimedia.org/T170150) [16:13:34] (03PS1) 10Aklapper: Revert "Disable EducationProgram on cs.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404312 (https://phabricator.wikimedia.org/T180426) [16:15:02] (03PS1) 10Jcrespo: mariadb: Switchover s3 codfw master from db2018 to db2036 [puppet] - 10https://gerrit.wikimedia.org/r/404313 (https://phabricator.wikimedia.org/T176243) [16:18:37] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Deprecate passwords::grafana::labs [labs/private] - 10https://gerrit.wikimedia.org/r/404311 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [16:18:46] (03CR) 10Marostegui: [C: 031] mariadb: Switchover s3 codfw master from db2018 to db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404310 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [16:19:23] (03CR) 10Marostegui: [C: 031] mariadb: Switchover s3 codfw master from db2018 to db2036 [puppet] - 10https://gerrit.wikimedia.org/r/404313 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [16:20:56] !log starting codfw s3 master switchover [16:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:26] (03CR) 10Jcrespo: [C: 032] mariadb: Switchover s3 codfw master from db2018 to db2036 [puppet] - 10https://gerrit.wikimedia.org/r/404313 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [16:23:13] (03PS1) 10Alexandros Kosiaris: Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150) [16:23:26] (03PS1) 10Gehel: wdqs: simplify logging of categories reload [puppet] - 10https://gerrit.wikimedia.org/r/404315 [16:23:49] PROBLEM - puppet last run on wdqs2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[prometheus-blazegraph-exporter] [16:24:05] !log restarting db2036 to set as master [16:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:39] godog: ^^^ did you deploy a new blazegraph exporter ? [16:25:16] gehel: I did yeah, but doesn't work as expected because blazegraph.service isn't a thing [16:25:48] godog: :) yeah, it is wdqs-blazegraph... [16:26:16] one more reason to have an systemd override in puppet ... [16:27:11] (03PS2) 10Alexandros Kosiaris: Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150) [16:27:13] (03PS2) 10Alexandros Kosiaris: grafana: Hieraize parameters [puppet] - 10https://gerrit.wikimedia.org/r/404309 (https://phabricator.wikimedia.org/T170150) [16:27:15] (03PS2) 10Alexandros Kosiaris: Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150) [16:27:19] indeed [16:27:57] (03CR) 10Gehel: "puppet compiler agrees: https://puppet-compiler.wmflabs.org/compiler03/9731/wdqs1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/404315 (owner: 10Gehel) [16:28:00] (03CR) 10jerkins-bot: [V: 04-1] Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [16:29:28] sigh jerkins-bot [16:29:43] I guess I can fold the 2 changes... that should resolve it [16:30:10] (03PS2) 10Giuseppe Lavagetto: apt: make apt::conf happen before removing apt.conf [puppet] - 10https://gerrit.wikimedia.org/r/404307 [16:30:37] (03PS3) 10Alexandros Kosiaris: Move role::grafana::base to profile::grafana [puppet] - 10https://gerrit.wikimedia.org/r/404308 (https://phabricator.wikimedia.org/T170150) [16:30:39] (03PS3) 10Alexandros Kosiaris: Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150) [16:31:55] !log Force WB on db2033 - T184888 [16:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:07] T184888: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888 [16:32:43] (03PS4) 10Alexandros Kosiaris: Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150) [16:34:29] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 40.30, 35.69, 31.82 [16:35:00] (03PS1) 10Filippo Giunchedi: Don't depend on blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/404316 (https://phabricator.wikimedia.org/T184434) [16:35:19] <_joe_> can someone look at 1227? [16:35:24] <_joe_> I'm doing something else [16:36:50] (03CR) 10Volans: [C: 031] "LGTM. Runs ok on my puppetmaster+client. It still means that we basically depends on the provided apt.conf in the installer image, and we " [puppet] - 10https://gerrit.wikimedia.org/r/404307 (owner: 10Giuseppe Lavagetto) [16:38:02] 10Operations, 10ops-codfw, 10DBA: Failed BBU on db2033 (x1 master) - https://phabricator.wikimedia.org/T184888#3900967 (10Marostegui) The server kept lagging. I have forced the controller to go to WriteBack temporarily till we decide how to proceed with this host. ``` root@db2033:~# hpssacli controller all s... [16:39:00] (03CR) 10Giuseppe Lavagetto: "This is by no means a definitive solution, it's just part of the solution for the contingent problem we're trying to solve." [puppet] - 10https://gerrit.wikimedia.org/r/404307 (owner: 10Giuseppe Lavagetto) [16:39:10] (03PS3) 10Giuseppe Lavagetto: apt: make apt::conf happen before removing apt.conf [puppet] - 10https://gerrit.wikimedia.org/r/404307 [16:41:03] <_joe_> !log restarting hhvm on mw1227, threads stuck in HPHP::jit::enterTCImpl [16:41:11] (03CR) 10Giuseppe Lavagetto: [C: 032] apt: make apt::conf happen before removing apt.conf [puppet] - 10https://gerrit.wikimedia.org/r/404307 (owner: 10Giuseppe Lavagetto) [16:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:40] (03CR) 10Gehel: [C: 031] "LGTM" [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/404316 (https://phabricator.wikimedia.org/T184434) (owner: 10Filippo Giunchedi) [16:43:12] (03CR) 10Jcrespo: [C: 032] mariadb: Switchover s3 codfw master from db2018 to db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404310 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [16:43:16] (03CR) 10Filippo Giunchedi: [C: 032] Don't depend on blazegraph [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/404316 (https://phabricator.wikimedia.org/T184434) (owner: 10Filippo Giunchedi) [16:43:34] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089 and db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404317 [16:43:46] (03CR) 10Marostegui: [C: 04-2] "Wait for s3 codfw switchover" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404317 (owner: 10Marostegui) [16:44:55] (03PS1) 10Alexandros Kosiaris: Deprecate passwords::grafana::production [labs/private] - 10https://gerrit.wikimedia.org/r/404318 [16:45:35] (03Merged) 10jenkins-bot: mariadb: Switchover s3 codfw master from db2018 to db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404310 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [16:46:03] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1089 and db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404317 [16:47:08] (03CR) 10jenkins-bot: mariadb: Switchover s3 codfw master from db2018 to db2036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404310 (https://phabricator.wikimedia.org/T176243) (owner: 10Jcrespo) [16:49:07] !log jynus@tin Synchronized wmf-config/db-codfw.php: Switchover s3 codfw master from db2018 to db2036 (duration: 01m 12s) [16:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:29] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 16.43, 18.74, 24.00 [16:49:44] !log finished codfw s3 master switchover [16:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:11] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1089 and db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404317 (owner: 10Marostegui) [16:50:18] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: prometheus-blazegraph-exporter failing to start after reboot - https://phabricator.wikimedia.org/T184434#3901003 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Done, fix deployed [16:50:54] (03PS1) 10Alexandros Kosiaris: Simplify profile::grafana::production [puppet] - 10https://gerrit.wikimedia.org/r/404319 (https://phabricator.wikimedia.org/T170150) [16:51:43] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089 and db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404317 (owner: 10Marostegui) [16:51:57] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089 and db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404317 (owner: 10Marostegui) [16:52:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Deprecate passwords::grafana::production [labs/private] - 10https://gerrit.wikimedia.org/r/404318 (owner: 10Alexandros Kosiaris) [16:52:22] 10Operations, 10Datasets-General-or-Unknown, 10Wikidata, 10HHVM, 10Patch-For-Review: Enable GC for HHVM CLI (at least for dump runners) - https://phabricator.wikimedia.org/T162245#3901016 (10ArielGlenn) Snapshot hosts are going directly to php7/stretch, bypassing this issue. See T181029. [16:53:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 and db1089 - T162807 (duration: 01m 12s) [16:53:15] 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3794643 (10ayounsi) Port 5 now works. Port 6 doesn't give a shell, but replies some characters on key-press. The other atlas don't seem to be connected to a scs so I can't compare. a few options: 1/ It'... [16:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:20] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [16:53:38] !log upgrade and restart db2018 [16:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:00] RECOVERY - cassandra-b CQL 10.64.32.131:9042 on restbase1017 is OK: TCP OK - 0.005 second response time on 10.64.32.131 port 9042 [17:01:17] (03PS1) 10Alexandros Kosiaris: grafana: Allow to modify the config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/404320 (https://phabricator.wikimedia.org/T170150) [17:02:21] RECOVERY - MariaDB Slave Lag: x1 on db2033 is OK: OK slave_sql_lag Replication lag: 42.98 seconds [17:02:52] RECOVERY - MariaDB Slave Lag: x1 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.01 seconds [17:06:32] RECOVERY - cassandra-b service on restbase1017 is OK: OK - cassandra-b is active [17:07:11] RECOVERY - cassandra-c SSL 10.64.32.132:7001 on restbase1017 is OK: SSL OK - Certificate restbase1017-c valid until 2018-08-17 16:11:33 +0000 (expires in 213 days) [17:07:48] 10Operations, 10ops-esams: install/designate other machines as esams bastion - https://phabricator.wikimedia.org/T184936#3901043 (10Dzahn) p:05Triage>03High [17:08:17] wow godog, both -b and -c came up at the same time? [17:08:50] !log bootstrap cassandra-c on restbase1017 [17:09:00] mobrovac: hehhe no I started it when I saw -b completed [17:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:09] :) [17:11:11] RECOVERY - puppet last run on restbase1017 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:11:21] 10Operations, 10ops-esams: install/designate other machine as esams bastion - https://phabricator.wikimedia.org/T184936#3901055 (10Dzahn) [17:17:01] RECOVERY - cassandra-c service on restbase1017 is OK: OK - cassandra-c is active [17:17:38] (03PS5) 10Alexandros Kosiaris: Remove role::grafana::labs [puppet] - 10https://gerrit.wikimedia.org/r/404314 (https://phabricator.wikimedia.org/T170150) [17:17:40] (03PS2) 10Alexandros Kosiaris: Simplify profile::grafana::production [puppet] - 10https://gerrit.wikimedia.org/r/404319 (https://phabricator.wikimedia.org/T170150) [17:17:42] (03PS2) 10Alexandros Kosiaris: grafana: Allow to modify the config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/404320 (https://phabricator.wikimedia.org/T170150) [17:17:44] (03PS1) 10Alexandros Kosiaris: WIP: grafana: Enable grafana's LDAP [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) [17:18:51] RECOVERY - puppet last run on wdqs2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:21:10] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3898087 (10jcrespo) As a heads up, this is now the s3 master. [17:22:00] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/compiler02/9737/krypton.eqiad.wmnet/ is pretty good. I still need to populate ldap.toml config file in" [puppet] - 10https://gerrit.wikimedia.org/r/404321 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [17:28:19] (03PS1) 10Jcrespo: mariadb: Set as spares labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/404323 (https://phabricator.wikimedia.org/T142807) [17:29:45] RECOVERY - nutcracker port on mw1340 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [17:29:48] RECOVERY - HHVM processes on mw1340 is OK: PROCS OK: 6 processes with command name hhvm [17:30:05] RECOVERY - High CPU load on API appserver on mw1340 is OK: OK - load average: 2.38, 2.39, 2.34 [17:30:06] RECOVERY - Disk space on mw1340 is OK: DISK OK [17:30:06] RECOVERY - dhclient process on mw1340 is OK: PROCS OK: 0 processes with command name dhclient [17:30:06] RECOVERY - DPKG on mw1340 is OK: All packages OK [17:30:15] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3901109 (10akosiaris) Patchsets above clean up puppetization, drop the ugly distinction of labs vs production from code, moving that into h... [17:30:25] RECOVERY - nutcracker process on mw1340 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [17:30:27] (03PS2) 10Jcrespo: mariadb: Set as spares labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/404323 (https://phabricator.wikimedia.org/T142807) [17:30:59] (03PS1) 10Filippo Giunchedi: prometheus: allow override of fs-related node-exporter options [puppet] - 10https://gerrit.wikimedia.org/r/404324 (https://phabricator.wikimedia.org/T184469) [17:31:16] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:31:20] (03CR) 10jerkins-bot: [V: 04-1] prometheus: allow override of fs-related node-exporter options [puppet] - 10https://gerrit.wikimedia.org/r/404324 (https://phabricator.wikimedia.org/T184469) (owner: 10Filippo Giunchedi) [17:31:25] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:31:56] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:32:05] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:32:25] PROBLEM - Check whether ferm is active by checking the default input chain on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:32:26] PROBLEM - nutcracker port on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:32:35] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:32:35] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:32:46] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:33:05] PROBLEM - puppet last run on dysprosium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:33:16] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:33:16] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:33:16] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:33:16] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:33:22] what is that? software, puppetmaster? [17:33:24] akosiaris: you jinxed it [17:33:26] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:33:26] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:33:36] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:33:40] jynus: puppetdb killed, restarted by systemd [17:33:46] ok cool [17:33:57] let's see the new dashboard luca put [17:34:05] PROBLEM - nutcracker process on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:34:07] PROBLEM - DPKG on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:34:07] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:34:07] I mean, not cool, but you get it [17:34:13] yeah ;) [17:34:26] PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:34:26] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:34:35] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:35:46] PROBLEM - Disk space on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:35:46] PROBLEM - puppet last run on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:36:35] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: connect to address 10.64.32.50 and port 9005: Connection refused [17:37:15] I'm re-running puppet on the failed hosts so that they recover now instead of in 30min, expect some recovery spam ;) [17:37:46] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:38:26] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:39:26] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:40:26] RECOVERY - HHVM rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 79239 bytes in 5.401 second response time [17:41:16] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:41:56] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:42:05] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:42:15] PROBLEM - nutcracker process on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:42:16] PROBLEM - DPKG on mw1338 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [17:42:26] RECOVERY - Check whether ferm is active by checking the default input chain on mw1338 is OK: OK ferm input default policy is set [17:42:26] PROBLEM - nutcracker port on mw1338 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [17:42:35] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:42:35] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [17:42:56] RECOVERY - Disk space on mw1338 is OK: DISK OK [17:43:05] RECOVERY - puppet last run on dysprosium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:43:15] RECOVERY - DPKG on mw1338 is OK: All packages OK [17:43:16] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:43:16] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:43:16] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:43:16] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:43:26] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:43:35] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:44:06] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:44:26] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:44:35] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:44:47] !log updating HHVM in deployment-prep to HHVM 3.18.7 [17:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:04] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3901174 (10fgiunchedi) [17:46:25] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:46:35] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [17:48:54] 10Operations, 10Traffic, 10Goal, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#3901180 (10fgiunchedi) p:05Triage>03Normal [17:49:47] 10Operations, 10Goal, 10Technical-Debt, 10User-fgiunchedi: Reduce technical debt in metrics monitoring - https://phabricator.wikimedia.org/T177195#3901194 (10fgiunchedi) [17:49:49] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3901192 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi [17:50:31] RECOVERY - nutcracker port on mw1338 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [17:55:50] RECOVERY - puppet last run on mw1338 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:56:01] RECOVERY - nutcracker process on mw1338 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [17:57:17] (03PS5) 10Dzahn: DHCP: switch from jessie to stretch as default installer [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) [18:00:04] gehel: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180115T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:32] jouncebot: nothing to deploy today... [18:01:56] !log uploading HHVM 3.18.7 (3.18.5+dfsg-1+wmf3) for jessie-wikimedia to apt.wikimedia.org [18:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:51] (03PS1) 10Filippo Giunchedi: restbase: reprovision restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/404325 (https://phabricator.wikimedia.org/T184100) [18:03:05] (03CR) 10jerkins-bot: [V: 04-1] restbase: reprovision restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/404325 (https://phabricator.wikimedia.org/T184100) (owner: 10Filippo Giunchedi) [18:04:25] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [18:06:54] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:f6f0:205::153) [18:09:34] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.59 ms [18:12:05] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms [18:13:33] 10Operations, 10ops-esams: install/designate other machine as esams bastion - https://phabricator.wikimedia.org/T184936#3901246 (10Dzahn) godog points out that we need to copy prometheus performance data from one host to another and that we should write an updated process how to do that [18:19:27] ACKNOWLEDGEMENT - HP RAID on db2036 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:6 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184946 [18:19:31] 10Operations, 10ops-codfw: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184946#3901253 (10ops-monitoring-bot) [18:34:03] !log oblivian@neodymium conftool action : set/pooled=active; selector: name=mw1338.eqiad.wmnet [18:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:36] RECOVERY - cassandra-c CQL 10.64.32.132:9042 on restbase1017 is OK: TCP OK - 0.000 second response time on 10.64.32.132 port 9042 [18:42:39] !log oblivian@neodymium conftool action : set/pooled=active; selector: name=mw1338.eqiad.wmnet [18:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:43] !log oblivian@puppetmaster1001 conftool action : set/pooled=active; selector: name=mw1338.eqiad.wmnet [18:43:52] <_joe_> uh? [18:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:58] <_joe_> something wrong, sorry [18:44:44] <_joe_> yeah, PEBKAC [18:50:11] (03Draft2) 10Jayprakash12345: Create "eliminator" user group on ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404327 (https://phabricator.wikimedia.org/T184607) [18:50:52] <_joe_> !log pooled mw1340 as an api appserver [18:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:14] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3901318 (10Andrew) I powered these off for the moment, just to cut down on dhcp noise. [18:52:57] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/399826 (https://phabricator.wikimedia.org/T182215) (owner: 10Dzahn) [19:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Morning SWAT (Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180115T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:20:59] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3901380 (10faidon) Thanks so much for this, kudos! Any reason to not just 301 grafana-admin to grafana for a few months (and then just drop... [19:34:26] RECOVERY - mediawiki-installation DSH group on mw1340 is OK: OK [19:40:13] 10Operations, 10ops-codfw: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184946#3901402 (10Marostegui) [19:40:15] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3901404 (10Marostegui) [19:59:44] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-netbox, looks like it thinks its a prod box - https://phabricator.wikimedia.org/T184242#3901425 (10ayounsi) Indeed, the instance is not needed anymore. I shut it down and will delete it in a few days. [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180115T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:00:57] 10Operations, 10Discourse, 10Developer-Relations (Jan-Mar-2018): Setup reply via email in discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T184592#3901477 (10Qgil) Thank you for your assistance, but it's still not working. https://meta.discourse.org/t/set-up-reply-via-email-support-e-mail/... [21:17:06] 10Operations, 10Discourse, 10Developer-Relations (Jan-Mar-2018): Setup reply via email in discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T184592#3901484 (10Tgr) It's under https://myaccount.google.com/apppasswords (a different thing from "apps with access to your account" which is about... [21:26:26] PROBLEM - HHVM rendering on mw2124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:27:17] RECOVERY - HHVM rendering on mw2124 is OK: HTTP OK: HTTP/1.1 200 OK - 79299 bytes in 0.398 second response time [21:39:38] 10Operations, 10fundraising-tech-ops, 10netops: switch network port 2/0/3 (frdb1003) back to administration-vlan - https://phabricator.wikimedia.org/T184723#3901504 (10ayounsi) 05Open>03Resolved a:03ayounsi Done! ``` [edit interfaces interface-range vlan-fundraising] - member "ge-[0-1]/0/3"; [edit i... [21:45:59] 10Operations, 10Traffic, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3901513 (10zhuyifei1999) [21:51:43] 10Operations, 10DNS, 10Traffic: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3316252 (10Krenair) wikibooks.wiki too - https://meta.wikimedia.org/wiki/Requests_for_comment/Domain_parking [21:58:58] (03PS3) 10BryanDavis: mariadb: Set as spares labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/404323 (https://phabricator.wikimedia.org/T184832) (owner: 10Jcrespo) [22:00:04] dapatrick, bawolff, and Reedy: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180115T2200). [22:00:04] No GERRIT patches in the queue for this window AFAICS. [23:05:30] PROBLEM - Host es1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:11:20] PROBLEM - Nginx local proxy to apache on mw2222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:10] RECOVERY - Nginx local proxy to apache on mw2222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.192 second response time [23:32:42] (03CR) 10Chad: [C: 04-2] "I don't see anything in the 2.14.7 log thats super important. We're already targeting and testing 2.14.6, let's not move the goalposts." [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120) (owner: 10Chad) [23:36:12] (03CR) 10Chad: [C: 032] Revert "Disable EducationProgram on cs.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404312 (https://phabricator.wikimedia.org/T180426) (owner: 10Aklapper) [23:37:30] PROBLEM - HHVM rendering on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:37:37] (03Merged) 10jenkins-bot: Revert "Disable EducationProgram on cs.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404312 (https://phabricator.wikimedia.org/T180426) (owner: 10Aklapper) [23:37:49] (03CR) 10jenkins-bot: Revert "Disable EducationProgram on cs.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404312 (https://phabricator.wikimedia.org/T180426) (owner: 10Aklapper) [23:38:20] RECOVERY - HHVM rendering on mw2130 is OK: HTTP OK: HTTP/1.1 200 OK - 79170 bytes in 0.312 second response time [23:40:34] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: turn educationprogram back on for cs.wikipedia -- turns out there was no consensus and a patch should never have been written 😡 (duration: 01m 13s) [23:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:47] 10Operations, 10DNS, 10Domains, 10Traffic, and 2 others: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3901643 (10Peachey88) p:05Low>03Triage Resetting priority for re-triage by ops on-call. Redirecting users though a random AWS account when they hit... [23:43:30] (03CR) 10Chad: "*shrug* Fixed version of scap will go live before this needs another deployment" [software/gerrit] - 10https://gerrit.wikimedia.org/r/404221 (https://phabricator.wikimedia.org/T184882) (owner: 10Paladox) [23:46:48] Hey operations, shinken isnt up [23:49:26] Zppix: Maybe ask cloud services? Production doesn't use it. [23:49:34] * no_justification goes back to his vacation [23:50:05] no_justification: i should of known that sorry [23:52:57] Also, most Americans will be off today, it's a federal holiday [23:53:19] s/Americans/people working in US timezones/ [23:53:31] Well i alerted them in the channel just incase they werent aware :) [23:53:38] Proper*