[00:06:04] still > Updating LocalisationCache for 1.32.0-wmf.20 using 30 thread(s) [00:07:24] > Generating JSON versions and md5 files [00:07:26] \o/ [00:07:40] yay [00:08:29] just syncing/rebuild cdb left -- usually faster than the update [00:16:42] cdb rebuild started [00:21:30] !log thcipriani@deploy1001 Finished scap: SWAT: [[gerrit:458586|Add italian translation]] T203297 [[gerrit:458609|Improve German translation]] [[gerrit:458617|German translation: Replace "Vertreter" by "EU-Abgeordnete"]] (duration: 34m 38s) [00:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:36] T203297: Manage translations for fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T203297 [00:21:36] ^ CindyCicaleseWMF should be live [00:21:50] excellent - checking [00:22:20] (03CR) 10Dzahn: ""parameters": {" [puppet] - 10https://gerrit.wikimedia.org/r/455744 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [00:22:59] hmm - the Italian messages are not showing up [00:24:00] thcipriani: we had a problem with the German messages not showing up after scap last night, but I that problem should be fixed. [00:24:42] is there a cache that needs to be cleared? [00:25:03] hrm, there may be [00:25:31] If legoktm is around, he may have some ideas [00:25:47] https://fixcopyright.wikimedia.org/?uselang=it is supposed to be in Italian now [00:27:20] Or krinkle might know [00:27:44] looking at the sal from last night [00:28:05] CindyCicaleseWMF: Was the translation recently added? [00:28:08] Or did it work yesterday? [00:28:10] (it) [00:28:22] added tonight, just scap sync'd [00:28:25] There were no Italian translations until now [00:28:39] https://phabricator.wikimedia.org/T203626 [00:29:05] Run from mwscript eval.php on mwmaint1001 instead of deploy1001 [00:29:13] $bs = new MessageBlobStore(); $bs->clear(); [00:29:21] with --wiki fixcopyrightwiki [00:29:24] thank you, krinkle ! :-) [00:29:58] Clearing of localisation cache from scap is broken since the roll out of mcrouter. Above task tracks the fixing of that [00:31:16] (03PS1) 10Dzahn: confd: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/458618 (https://phabricator.wikimedia.org/T194724) [00:31:37] will l10nupdate work for fixcopyrightwiki while that's broken? [00:32:00] hrm, ok, just ran that with fixcopyrightwiki [00:32:02] on mwmaint [00:33:19] Should I be seeing Italian yet? I'm not . . . [00:33:24] I'm not either :\ [00:34:49] krinkle was there anything else? [00:34:50] I'm also not seeing memcached errors in the logs for deploy1001 like yesterday [00:34:53] so probably not related issue [00:34:54] huh [00:35:24] Can you check a message key manaually from mwmaint eval.php --wiki, using echo wfMessage( key )->text() [00:35:44] That would rule out whether it's LocalisationCache/MessageCache related to ResourceLoader. [00:35:55] LCM/MC related, or ResourceLoader * [00:36:11] wfMesage( key )->inLanguage( 'it' )->text(); [00:36:21] If that also shows English, then it' not cache related. [00:36:46] But rather means scap didn't update the actual files correctly, or, the messages aren't translated/deployed. [00:37:24] seems to work via eval https://phabricator.wikimedia.org/P7523 [00:37:34] on mwmaint [00:37:49] CindyCicaleseWMF: l10nupdate will have been a no-op for javascript messages between last month and yesterday. But fine for messages output in PHP in skins and extensions. And the JS messages are fine again since yesterday. [00:38:02] I see the Italian messages on the wiki: https://fixcopyright.wikimedia.org/w/index.php?title=Special%3AAllMessages&prefix=eucc&filter=all&lang=it&limit=50 [00:38:27] krinkle: good news [00:39:25] But not for this message [00:39:26] https://fixcopyright.wikimedia.org/w/index.php?title=Special%3AAllMessages&prefix=euccs-call-to-action-text&filter=all&lang=it&limit=50 [00:39:40] Got the key from https://fixcopyright.wikimedia.org/?uselang=qqx [00:39:56] right, that one got missed in this round of translation [00:40:15] but, there were no italian messages before the patch that just went in [00:40:16] that's the big one for most of the front page [00:41:00] krinkle, why yes it is - ugh [00:41:21] Scrolling down at https://fixcopyright.wikimedia.org/?uselang=it [00:41:29] so, your script above did fix the others [00:41:30] I do see eucc-country-picker-layout-label in Italian fine [00:42:00] that's actually the only italian word I see on the page [00:42:05] And, if you choose a country, the rest is in Italian. So, we're good from that perspecitve. [00:42:07] the label of the dropdown and the dropdown [00:42:12] Sooooo [00:42:27] That may've been Varnish caching it until we ran the script, or some other intermittent issue [00:42:30] my script did nothing :P [00:42:34] Am I correct that it is not possible to deploy tomorrow? [00:42:43] the scap command ran the same script as well [00:42:43] so I've been fiddling with ->updateMessage on the blobstore [00:43:10] is there a translation that exists and can be seen as Italian in Gerrit/Git, but not on the site? [00:43:11] that has worked in the past to clear the resource loader cache [00:43:43] thcipriani: clear() is the nuclear approach, which updateLocalisationCache from scap does and we ran a second time [00:43:50] krinkle, no the rest appears to be good [00:43:54] there is no rl cache issue :) [00:44:12] the problem is that message comes from the skin, not the extension, and we only got a translation for the extension [00:44:21] Krinkle: that's what I figured, re:clear [00:44:49] Yeah, the big messages above the fold (header, body text, etc) haven't been translated yet, or at least not deployed. [00:44:52] I should have realized that was missing. [00:45:21] A small warning bell went off when I only got the extension translation, but it should have been a loud gong. [00:46:06] I will get the translation and put it on the wiki for now. [00:46:40] krinkle: thcipriani: Am I correct that there are no deployments on Fridays? [00:47:13] when is this going live? [00:47:27] Last night [00:47:51] But, the translations weren't ready, so we're madly trying to catch up. [00:48:24] I see [00:48:42] There are generally no deployments of code on Fridays [00:48:58] which makes perfect sense [00:49:13] Well, we rarely take the site down for the same issue twice, so the comparison isn't entirely fair, but, localisation updates have been among the more frequent causes of outages. As such, not a low risk, and that's why nightly l10nupdate doesn't run on Friday/weekends. [00:49:15] translations seem a little less risky in general [00:49:32] should be, yes. [00:49:34] but aren't. [00:50:16] CindyCicaleseWMF: The main risk is from having to run scap and regenerate parts of MW that also control parser magic words and special page names. [00:50:42] Invalidating the front layer only is risk-free, but doesn't do anything without regenerating the big LocalisationCcache thing [00:50:46] the front layer being "MessageCache". [00:50:53] There's actually one other change they'd like to get in ASAP that didn't make it for tonight. It removes two MEPs and adds one other to the JSON data files. So, that doesn't involve running scap. [00:50:58] CindyCicaleseWMF: but, you can go old school and use MeidaWIki namespace pages. [00:51:20] Every time you press edit on a a NS_MW page, the MessageCache layer and ResourceLoader layer is cleared for that message key in all languages. [00:51:35] copy/paste, is a bit laberous but would work. [00:51:36] Yes, that was what I suggested. Then the changes can go in on the backend next week. [00:51:50] s/edit/save/ [00:52:26] It isn't pretty, but it works. [00:52:34] Yeah, "editing wiki pages" on a Friday is fine, naturally :) [00:52:49] Yes :-) They can even edit all weekend and be fine. [00:53:51] better even, there's a maintenance script we can run to auto-delete them after the next scap (if and only if, they are identical) to avoid them from going stale (e.g. if another twn update make them better, they'll get through instead of requiring manual updating/deleting). [00:54:05] deleteEqualMessages.php [00:54:13] Ooooh! That's great!! [00:54:24] So just run that every time after a scap and you can quickly undo the copy/paste things [00:54:42] I told them they'd have to do that manually, so at least that will be easier than expected :-) [00:55:38] so the translations are going to translatewiki, and then someone copies them to the wiki, right? If directly, I guess thats fine too, the special pagae will help track what's local and what not. [00:56:05] Right. [00:57:22] I was figuring we'd monitor the special page but the script would help enormously to know if it was safe to remove the messages [00:58:21] So, I think we're good for translation for the weekend. For the MEP change in the JSON file that didn't get in tonight, if they deem that is important enough they want it done tomorrow, what would be the process? [00:58:43] 10Operations, 10Performance-Team: deploy1001 can't talk to memcached, breaking invalidation of RL localization cache - https://phabricator.wikimedia.org/T203626 (10Krinkle) 05Open>03Resolved [00:58:48] 10Operations, 10Performance-Team: deploy1001 can't talk to memcached, breaking invalidation of RL localization cache - https://phabricator.wikimedia.org/T203626 (10Krinkle) [00:58:50] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370 (10Krinkle) [00:58:57] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370 (10Krinkle) [00:59:28] CindyCicaleseWMF: if it needs to be done tomorrow, msg greg-g on IRC, he'd be the one to make exceptions [00:59:46] then he'll wrangle someone to get it out the door [01:00:07] thcipriani: gotcha - thanks! Hopefully we can avoid that. [01:00:29] But good to know there's a possibility if ABSOLUTELY necessary. [01:00:31] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370 (10Krinkle) Note that `memcached-pecl` (which uses Nutcracker) is still used in wmf-config in two places: 1. On... [01:00:53] thcipriani: krinkle: thank you both very much for your help tonight! [01:01:46] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:04:02] sure thing! happy when I'm helpful :) [01:06:47] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:33:21] (03PS1) 10Tim Starling: Set PHP time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458623 (https://phabricator.wikimedia.org/T97192) [01:34:19] (03PS1) 10Tim Starling: Increase Apache timeouts by 2 seconds [puppet] - 10https://gerrit.wikimedia.org/r/458624 (https://phabricator.wikimedia.org/T97192) [01:34:21] (03CR) 10jerkins-bot: [V: 04-1] Set PHP time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458623 (https://phabricator.wikimedia.org/T97192) (owner: 10Tim Starling) [01:35:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:40:47] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:41:22] (03PS2) 10Tim Starling: Set PHP time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458623 (https://phabricator.wikimedia.org/T97192) [01:53:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:58:26] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:59:16] PROBLEM - Filesystem available is greater than filesystem size on ms-be2042 is CRITICAL: cluster=swift device=/dev/sdj1 fstype=xfs instance=ms-be2042:9100 job=node mountpoint=/srv/swift-storage/sdj1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2042&var-datasource=codfw%2520prometheus%252Fops [02:14:33] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10MW-1.32-release-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), 10Patch-For-Review: Sort out HTTP caching issues for fixcopyright wiki - https://phabricator.wikimedia.org/T203179 (10Legoktm) [02:53:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:58:56] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:11:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:12:13] (03PS1) 10Legoktm: Prefer using npm from /usr/local/bin/npm [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/458629 (https://phabricator.wikimedia.org/T169451) [03:16:27] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:29:47] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 947.80 seconds [03:39:39] (03CR) 10Mathew.onipe: Elasticsearch module is coming up. (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [03:40:28] (03PS23) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [03:45:07] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 225.84 seconds [03:50:47] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [03:56:06] PROBLEM - Filesystem available is greater than filesystem size on ms-be2041 is CRITICAL: cluster=swift device=/dev/sdh1 fstype=xfs instance=ms-be2041:9100 job=node mountpoint=/srv/swift-storage/sdh1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [03:57:16] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [04:25:15] (03CR) 10Krinkle: [C: 032] Prefer using npm from /usr/local/bin/npm [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/458629 (https://phabricator.wikimedia.org/T169451) (owner: 10Legoktm) [04:25:42] (03Merged) 10jenkins-bot: Prefer using npm from /usr/local/bin/npm [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/458629 (https://phabricator.wikimedia.org/T169451) (owner: 10Legoktm) [04:25:44] legoktm: for the docker image one, can I merge that, or is there a deploy process invovled? [04:28:00] there's a deploy process [04:28:14] k, I'll let wmcs merge it then. [04:28:18] I first have to build ans upload a new webservice package first [04:37:31] 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10Shahadat) [04:55:36] (03PS1) 10Legoktm: Bump changelog [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/458632 [04:56:01] (03CR) 10Legoktm: [C: 032] Bump changelog [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/458632 (owner: 10Legoktm) [04:56:51] (03Merged) 10jenkins-bot: Bump changelog [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/458632 (owner: 10Legoktm) [04:58:18] !log Disable puppet on dbproxy1006 for logging testing - T201021 [04:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:25] T201021: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 [05:01:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:06:47] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:13:41] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10Marostegui) Using `crit` instead of `info` give us when a server goes down but it doesn't log every single health check request. So we get to know when the fail... [05:16:52] (03PS1) 10Marostegui: dbproxy: Reduce the amount of logging [puppet] - 10https://gerrit.wikimedia.org/r/458633 (https://phabricator.wikimedia.org/T201021) [05:17:53] (03CR) 10Marostegui: [C: 032] dbproxy: Reduce the amount of logging [puppet] - 10https://gerrit.wikimedia.org/r/458633 (https://phabricator.wikimedia.org/T201021) (owner: 10Marostegui) [05:19:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:21:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0 [05:21:37] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0 [05:22:16] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 23 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:24:16] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:27:17] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:33:16] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 [05:36:02] (03PS3) 10Legoktm: Update npm to 6.4.0 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453666 (https://phabricator.wikimedia.org/T169451) [05:36:16] (03PS4) 10Legoktm: Update npm to 6.4.0 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453666 (https://phabricator.wikimedia.org/T169451) [05:36:36] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:41:37] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:47:03] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See the inline comments." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458623 (https://phabricator.wikimedia.org/T97192) (owner: 10Tim Starling) [05:59:29] (03CR) 10Tim Starling: Set PHP time limit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458623 (https://phabricator.wikimedia.org/T97192) (owner: 10Tim Starling) [06:18:28] (03PS3) 10Tim Starling: Set PHP time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458623 (https://phabricator.wikimedia.org/T97192) [06:21:39] (03CR) 10Giuseppe Lavagetto: [C: 031] Set PHP time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458623 (https://phabricator.wikimedia.org/T97192) (owner: 10Tim Starling) [06:22:38] (03CR) 10Giuseppe Lavagetto: [C: 031] Increase Apache timeouts by 2 seconds [puppet] - 10https://gerrit.wikimedia.org/r/458624 (https://phabricator.wikimedia.org/T97192) (owner: 10Tim Starling) [06:30:27] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:43:59] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370 (10Joe) >>! In T192370#4564888, @Krinkle wrote: > Note that `memcached-pecl` (which uses Nutcracker) is still use... [06:47:34] (03PS2) 10Tim Starling: Increase Apache timeouts by 2 seconds [puppet] - 10https://gerrit.wikimedia.org/r/458624 (https://phabricator.wikimedia.org/T97192) [06:49:17] !log rebooting mw2240-mw2269 for kernel security updates [06:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:37] (03CR) 10Tim Starling: [C: 032] Increase Apache timeouts by 2 seconds [puppet] - 10https://gerrit.wikimedia.org/r/458624 (https://phabricator.wikimedia.org/T97192) (owner: 10Tim Starling) [06:50:46] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [06:51:58] !log rebooting mw2240-mw2269 for kernel security updates [06:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:48] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:55:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:03:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:03:42] (03CR) 10Jcrespo: "Will we get with the cli "if" and other exceptions, aside from the jobqueue below, the cronjobs and dumps? Sorry, I don't know much about " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458623 (https://phabricator.wikimedia.org/T97192) (owner: 10Tim Starling) [07:08:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:09:16] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:14:26] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:15:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:18:55] (03CR) 10Banyek: [C: 032] Labs: Make redact_sanitarium.sh file easier to read [puppet] - 10https://gerrit.wikimedia.org/r/457899 (owner: 10Banyek) [07:19:38] (03PS4) 10Banyek: Labs: Make redact_sanitarium.sh file easier to read [puppet] - 10https://gerrit.wikimedia.org/r/457899 [07:19:56] (03CR) 10Banyek: [V: 032 C: 032] Labs: Make redact_sanitarium.sh file easier to read [puppet] - 10https://gerrit.wikimedia.org/r/457899 (owner: 10Banyek) [07:20:47] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:22:27] (03CR) 10Banyek: [C: 032] tendril: Add monitoring for authorization check [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) (owner: 10Banyek) [07:23:16] (03PS8) 10Banyek: tendril: Add monitoring for authorization check [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) [07:23:28] (03CR) 10Banyek: [V: 032 C: 032] tendril: Add monitoring for authorization check [puppet] - 10https://gerrit.wikimedia.org/r/456099 (https://phabricator.wikimedia.org/T149340) (owner: 10Banyek) [07:24:42] !log rebooting mw2270-mw2290 for kernel security updates [07:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:31] (03CR) 10Krinkle: "Might be worth considering to add to PhpAutoPrepend so that it applies before Multiversion and MediaWiki instead." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458623 (https://phabricator.wikimedia.org/T97192) (owner: 10Tim Starling) [07:27:18] 10Operations, 10MediaWiki-API, 10Availability, 10HHVM, and 5 others: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192 (10Krinkle) [07:33:16] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:38:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:45:36] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Spin up a new poolcounter node for ores - https://phabricator.wikimedia.org/T201824 (10akosiaris) 05Open>03Resolved a:03akosiaris Done in T203465 [07:50:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:54:33] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:55:43] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:59:14] ^apparently network maintenance on ulsfo is ongoing [07:59:31] speak up if you think it is unrelated to that [07:59:33] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:07:43] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:07:44] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:12:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:16:14] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:16:59] !log genarting false alert about https auth on dbmonitor1001 [08:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:05] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:33:09] !log Rebooting haproxies to pick up new config after all the tests - T201021 [08:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:14] T201021: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 [08:41:10] (03PS1) 10Banyek: Tendril: Renaming check about authorization required [puppet] - 10https://gerrit.wikimedia.org/r/458729 [08:43:57] 10Operations, 10Services (watching): Create nodejs 10 packages - https://phabricator.wikimedia.org/T203239 (10MoritzMuehlenhoff) nodejs 10 packages will be in a separate repository component, allowing applications to gradually move over. We'll continue to support nodejs 6 with security updates until all applic... [08:44:46] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:45:39] !log reloading apache with bad config for tendril for testing (small downtime) [08:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:41] (03CR) 10Muehlenhoff: [C: 031] "Looks good. You can ignore the failing tests, it unrelated and caused by T203645" [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [08:47:44] passive checks are awol, T196336 perhaps? [08:47:45] T196336: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 [08:48:51] 10Operations, 10Puppet: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Joe) >>! In T203674#4563061, @jcrespo wrote: > So my initial suggestion was to create a debian package for the following reasons: > > * Source control patches on a separate repo so... [08:49:45] !log passive checks awol on einsteinium, restarting icinga -- T196336 [08:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:19] (03CR) 10Gehel: Elasticsearch module is coming up. (0318 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [08:52:01] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:56:52] (03CR) 10Jcrespo: [C: 031] Tendril: Renaming check about authorization required [puppet] - 10https://gerrit.wikimedia.org/r/458729 (owner: 10Banyek) [08:57:39] (03CR) 10Banyek: [C: 032] Tendril: Renaming check about authorization required [puppet] - 10https://gerrit.wikimedia.org/r/458729 (owner: 10Banyek) [08:59:26] 10Operations, 10DBA, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356 (10jcrespo) [08:59:29] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10jcrespo) [08:59:36] 10Operations, 10Cloud-Services, 10Community-Wikimetrics, 10DBA, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625 (10jcrespo) 05Open>03Resolved a:03jcrespo This has been evaluate, mysql will disappear, leaving this for historical documentation, for... [09:02:56] (03PS1) 10Muehlenhoff: Remove host-specific Hiera settings for mw1307/mw1318 [puppet] - 10https://gerrit.wikimedia.org/r/458739 [09:04:56] Hallo [09:05:50] zeljkof, apergos , I need some urgent help. Interlanguage links are completely broken: https://phabricator.wikimedia.org/T203750 [09:06:26] There's a patch that reverts what broke them: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/UniversalLanguageSelector/+/458725/ [09:06:31] It's merged. [09:06:41] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10Marostegui) 05Open>03Resolved I have reloaded haproxy everywhere and it is now under control and not logging every single second :-). Also purged all the l... [09:06:54] Is there a way to deploy it urgently? I know it's Friday, but a major feature is broken. [09:09:20] we can surely get it done today. let's see who deployers are in the eu timezone [09:10:35] o/ [09:10:39] aharoni: apergos I can do it [09:10:43] hey hashar [09:10:52] I am on train duty this week [09:11:01] cool, I suppose I should look at the ticket (I have not read it closely yet) [09:11:22] aharoni: is the fix confirmed on beta cluster? [09:11:33] (03CR) 10DCausse: [C: 031] "lgtm if this state will last for a very short period" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [09:11:56] I'm around if needed too [09:12:10] cherry picked at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/UniversalLanguageSelector/+/458743 [09:12:18] will +2 then deploy on mwdebug1001 [09:13:27] hashar: haven't tried on beta [09:13:47] I +2ed it [09:14:12] that will merge eventually [09:15:09] hashar: what's the URL for the beta cluster? I get lost with all the different URLs for beta, test, labs, etc. [09:15:24] aharoni: https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [09:15:27] which runs out of master [09:15:42] (it is auto updated from git master branches every 10 minutes or so ) [09:17:13] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:17:58] what page is going to have a 'more' link for languages, that's maybe harder in beta [09:19:22] apergos: yeah... I'm trying to emulate something [09:20:03] Nope... Not a lot of interlanguage links on beta [09:20:07] bleh [09:20:12] not sp much no [09:20:19] How are the interlanguage links work there on the Main page?? [09:20:38] it looks like that's soecial [09:20:43] special [09:24:30] hashar: so, are we just waiting? or do you need to run scap or any scripts or something? [09:24:33] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:24:46] aharoni: yeah waiting for patch to merge [09:24:50] https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-hhvm-docker/5102/console [09:24:53] it is almost done [09:26:05] cool [09:27:20] https://integration.wikimedia.org/zuul/ watching it here [09:30:14] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:32:15] looks like it's done [09:32:44] the CI instance is too slow :\\\ [09:33:05] (03PS3) 10Ema: ATS: specify mapping rules for all text/upload backends [puppet] - 10https://gerrit.wikimedia.org/r/458536 (https://phabricator.wikimedia.org/T199720) [09:34:00] (03CR) 10Ema: [C: 032] ATS: specify mapping rules for all text/upload backends [puppet] - 10https://gerrit.wikimedia.org/r/458536 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [09:34:06] aharoni: deploying to mwdebug1001 [09:34:07] 10Operations, 10Puppet: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10jcrespo) > This is not true if a binary debian package is built, as proposed. In fact, you can consider a binary-only package (built with dpkg-deb) a glorified tarball. I don't see... [09:34:44] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:34:56] aharoni: it is live on mwdebug1001 [09:35:24] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [09:35:51] (03Abandoned) 10Muehlenhoff: Disable fetching the netboot image via HTTP for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/458463 (https://phabricator.wikimedia.org/T199125) (owner: 10Muehlenhoff) [09:35:54] aharoni: with firefox, that works on mwdebug1001 now :] [09:37:05] hashar: ah, drat, I don't have the mwdebug1001 extension on my browser now [09:37:22] well I tested the use case from the task [09:37:27] looks good to me so I am syncing it [09:37:38] hashar: but if you tested it, and you can click the "More" button and the panel with the languages list opens, then it's good [09:37:42] thank you so much [09:38:25] aharoni: and under firefox that gave me an exception. I have added it to the task https://phabricator.wikimedia.org/T203750#4565583 [09:38:52] !log hashar@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/UniversalLanguageSelector: Revert "Simplify by using native JavaScript instead of jQuery" - T203750 (duration: 00m 55s) [09:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:57] T203750: Button to open complete list of languages is no longer functional on en-Wiki - https://phabricator.wikimedia.org/T203750 [09:39:38] hashar: the exception was before deploying the patch, right? [09:41:01] (03PS1) 10Giuseppe Lavagetto: spec: monkey-patch the puppet gem to work like the debian package [puppet] - 10https://gerrit.wikimedia.org/r/458750 [09:41:03] (03PS1) 10Giuseppe Lavagetto: authdns: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/458751 [09:41:05] (03PS1) 10Giuseppe Lavagetto: httpd: add monkey-patching of the service provider in tests [puppet] - 10https://gerrit.wikimedia.org/r/458752 [09:41:07] (03PS1) 10Giuseppe Lavagetto: install_server: add monkey-patching of the service provider in tests [puppet] - 10https://gerrit.wikimedia.org/r/458753 [09:41:09] (03PS1) 10Giuseppe Lavagetto: puppetmaster: add monkey-patching of the service provider in tests [puppet] - 10https://gerrit.wikimedia.org/r/458754 [09:41:11] (03PS1) 10Giuseppe Lavagetto: Rakefile: add lvm (a third-party module) to the ignored modules list [puppet] - 10https://gerrit.wikimedia.org/r/458755 [09:41:39] <_joe_> hashar: ^^ for your pleasure, sire [09:42:13] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:44:57] aharoni: yeah that was before the patch [09:45:10] _joe_: will look at them in a few :] [09:45:42] <_joe_> hashar: no rush, just making you aware [09:45:56] I now get a popup in chrome, I assume that's the desired behavior [09:46:43] from the 'more' link on en wp teahouse [09:47:39] hashar: tested in production. works. thanks!!! [09:48:40] yep thanks a lot [09:52:23] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:54:41] _joe_: +require_relative '../../../rake_modules/fix_service_provider' [09:54:41] .... [09:54:52] yet another unknown ruby function. That is creative :] [09:56:10] <_joe_> to avoid too much copy/pasta [09:57:12] (03PS1) 10Ema: cache_text: remove grafana-admin request handling [puppet] - 10https://gerrit.wikimedia.org/r/458764 (https://phabricator.wikimedia.org/T170150) [09:58:14] an alternative I thought about is to create a local gem and reference it in Gemfile. Bundler would then always load it :] This way we dont even need to add the require_relative everywhere [10:00:44] (03CR) 10Hashar: [C: 031] "I love how that avoid repeating the code :]" [puppet] - 10https://gerrit.wikimedia.org/r/458750 (owner: 10Giuseppe Lavagetto) [10:01:30] (03CR) 10Alexandros Kosiaris: [C: 031] cache_text: remove grafana-admin request handling [puppet] - 10https://gerrit.wikimedia.org/r/458764 (https://phabricator.wikimedia.org/T170150) (owner: 10Ema) [10:01:52] (03CR) 10Ema: [C: 032] cache_text: remove grafana-admin request handling [puppet] - 10https://gerrit.wikimedia.org/r/458764 (https://phabricator.wikimedia.org/T170150) (owner: 10Ema) [10:03:18] !log ladsgroup@mwmaint1001:~$ mwscript extensions/CentralAuth/maintenance/deleteLocalPasswords.php --wiki=fawiki --user Ladsgroup --prefix (T201009) [10:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:24] T201009: Run deleteLocalPasswords.php in WMF prod (Central Auth wikis only!) after 1.32.0-wmf.16 is everywhere - https://phabricator.wikimedia.org/T201009 [10:03:56] (03PS3) 10Elukey: Remove meitnerium (old archiva host) from puppet [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) [10:04:41] (03CR) 10jerkins-bot: [V: 04-1] Remove meitnerium (old archiva host) from puppet [puppet] - 10https://gerrit.wikimedia.org/r/458519 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [10:04:44] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [10:05:23] (03CR) 10Giuseppe Lavagetto: [C: 032] spec: monkey-patch the puppet gem to work like the debian package [puppet] - 10https://gerrit.wikimedia.org/r/458750 (owner: 10Giuseppe Lavagetto) [10:05:32] (03CR) 10Elukey: [C: 031] Remove host-specific Hiera settings for mw1307/mw1318 [puppet] - 10https://gerrit.wikimedia.org/r/458739 (owner: 10Muehlenhoff) [10:05:37] (03PS2) 10Giuseppe Lavagetto: spec: monkey-patch the puppet gem to work like the debian package [puppet] - 10https://gerrit.wikimedia.org/r/458750 [10:05:50] (03CR) 10Hashar: [C: 04-1] authdns: fix spec tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458751 (owner: 10Giuseppe Lavagetto) [10:06:42] (03CR) 10Hashar: [C: 031] httpd: add monkey-patching of the service provider in tests [puppet] - 10https://gerrit.wikimedia.org/r/458752 (owner: 10Giuseppe Lavagetto) [10:08:13] (03CR) 10Hashar: [C: 031] install_server: add monkey-patching of the service provider in tests [puppet] - 10https://gerrit.wikimedia.org/r/458753 (owner: 10Giuseppe Lavagetto) [10:08:45] (03CR) 10Hashar: [C: 031] puppetmaster: add monkey-patching of the service provider in tests [puppet] - 10https://gerrit.wikimedia.org/r/458754 (owner: 10Giuseppe Lavagetto) [10:09:58] (03PS1) 10Vgutierrez: Make configurable the cmd executed to perform a DNS zone update [software/certcentral] - 10https://gerrit.wikimedia.org/r/458767 (https://phabricator.wikimedia.org/T203678) [10:10:08] (03PS3) 10Urbanecm: Create new namespaces in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455395 (https://phabricator.wikimedia.org/T201675) [10:10:19] (03CR) 10jerkins-bot: [V: 04-1] Create new namespaces in zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455395 (https://phabricator.wikimedia.org/T201675) (owner: 10Urbanecm) [10:11:33] (03CR) 10Hashar: [C: 031] Rakefile: add lvm (a third-party module) to the ignored modules list [puppet] - 10https://gerrit.wikimedia.org/r/458755 (owner: 10Giuseppe Lavagetto) [10:12:10] _joe_: all reviewed/tested. authdns sets the :initsystem fact which might be a leftover ( https://gerrit.wikimedia.org/r/458751 ) [10:12:39] <_joe_> hashar: not really, but maybe it can be removed [10:12:49] still passes if removed [10:12:51] at least locally [10:12:53] <_joe_> ok [10:12:57] <_joe_> then let's remove it [10:13:23] (03PS1) 10Ema: cache_text: remove lab{,test}spice [puppet] - 10https://gerrit.wikimedia.org/r/458769 [10:14:43] 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10Aklapper) Hi @Shahadat. Please see https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list for required information. Thanks! [10:14:55] 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10Aklapper) [10:20:03] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [10:20:41] (03PS1) 10Ladsgroup: Add $wgPasswordConfig['null'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458770 (https://phabricator.wikimedia.org/T201009) [10:21:04] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:21:59] _joe_: ^ maybe that's your change? [10:22:08] <_joe_> argh, yes [10:22:13] <_joe_> merged [10:22:19] thanks! [10:22:34] <_joe_> as usual, with noop changes for production like spec things, I forget to hit enter on puppet-merge :P [10:23:23] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [10:24:30] (03PS2) 10Giuseppe Lavagetto: httpd: add monkey-patching of the service provider in tests [puppet] - 10https://gerrit.wikimedia.org/r/458752 [10:24:32] (03PS2) 10Giuseppe Lavagetto: install_server: add monkey-patching of the service provider in tests [puppet] - 10https://gerrit.wikimedia.org/r/458753 [10:24:34] (03PS2) 10Giuseppe Lavagetto: puppetmaster: add monkey-patching of the service provider in tests [puppet] - 10https://gerrit.wikimedia.org/r/458754 [10:24:36] (03PS2) 10Giuseppe Lavagetto: Rakefile: add lvm (a third-party module) to the ignored modules list [puppet] - 10https://gerrit.wikimedia.org/r/458755 [10:24:38] (03PS2) 10Giuseppe Lavagetto: authdns: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/458751 [10:25:34] (03CR) 10Giuseppe Lavagetto: [C: 032] httpd: add monkey-patching of the service provider in tests [puppet] - 10https://gerrit.wikimedia.org/r/458752 (owner: 10Giuseppe Lavagetto) [10:25:51] (03CR) 10Giuseppe Lavagetto: [C: 032] install_server: add monkey-patching of the service provider in tests [puppet] - 10https://gerrit.wikimedia.org/r/458753 (owner: 10Giuseppe Lavagetto) [10:25:53] (03PS2) 10Vgutierrez: Make configurable the cmd executed to perform a DNS zone update [software/certcentral] - 10https://gerrit.wikimedia.org/r/458767 (https://phabricator.wikimedia.org/T203678) [10:26:05] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: add monkey-patching of the service provider in tests [puppet] - 10https://gerrit.wikimedia.org/r/458754 (owner: 10Giuseppe Lavagetto) [10:26:21] (03CR) 10Giuseppe Lavagetto: [C: 032] Rakefile: add lvm (a third-party module) to the ignored modules list [puppet] - 10https://gerrit.wikimedia.org/r/458755 (owner: 10Giuseppe Lavagetto) [10:27:17] 10Operations, 10Goal: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from codfw - https://phabricator.wikimedia.org/T203776 (10akosiaris) p:05Triage>03High [10:28:13] (03CR) 10Giuseppe Lavagetto: authdns: fix spec tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458751 (owner: 10Giuseppe Lavagetto) [10:28:37] 10Operations, 10Goal: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from codfw - https://phabricator.wikimedia.org/T203776 (10akosiaris) [10:28:39] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10akosiaris) [10:29:01] 10Operations, 10Goal: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from codfw - https://phabricator.wikimedia.org/T203776 (10akosiaris) [10:29:04] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10akosiaris) [10:29:47] 10Operations, 10Goal: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from codfw - https://phabricator.wikimedia.org/T203776 (10akosiaris) [10:29:50] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10akosiaris) [10:32:24] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [10:37:32] (03PS1) 10Alexandros Kosiaris: cache::text: switch mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/458772 (https://phabricator.wikimedia.org/T203776) [10:37:36] (03PS1) 10Alexandros Kosiaris: cache::text: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458773 (https://phabricator.wikimedia.org/T203777) [10:51:04] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 [10:52:43] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:00:03] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:01:04] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0 [11:06:50] (03PS1) 10Effie Mouzeli: admin: Added dotfiles for user jiji [puppet] - 10https://gerrit.wikimedia.org/r/458776 (https://phabricator.wikimedia.org/T201816) [11:07:36] (03CR) 10Effie Mouzeli: [C: 032] admin: Added dotfiles for user jiji [puppet] - 10https://gerrit.wikimedia.org/r/458776 (https://phabricator.wikimedia.org/T201816) (owner: 10Effie Mouzeli) [11:07:43] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 [11:15:23] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:27:44] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:32:44] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:32:50] 10Operations: Support for QLogic FastLinQ 41112 Dual Port 10Gb SFP+ Adapter - https://phabricator.wikimedia.org/T202255 (10MoritzMuehlenhoff) I now have a stretch netboot image with a 4.14 kernel which PXE boots via the QLogic 41xx adapter. In d-i I'm getting a strange error message which tells me that no module... [11:40:04] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:45:14] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:48:40] 10Operations, 10Analytics, 10Patch-For-Review: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) [11:51:00] (03PS1) 10Elukey: Decommission meitnerium (old archiva host) [dns] - 10https://gerrit.wikimedia.org/r/458783 (https://phabricator.wikimedia.org/T203087) [11:52:48] 10Operations, 10Analytics, 10Patch-For-Review: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) [11:54:54] PROBLEM - puppet last run on wdqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:24] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:59:10] this one is mcrouter not able to talk with mc1035 -^ [11:59:34] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:11:47] 10Operations, 10MediaWiki-Cache: Mcrouter periodically reports soft TKOs for mc1035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) p:05Triage>03Normal [12:20:24] RECOVERY - puppet last run on wdqs1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:24:24] (03PS1) 10Alexandros Kosiaris: db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/458787 (https://phabricator.wikimedia.org/T203776) [12:24:37] (03CR) 10jerkins-bot: [V: 04-1] db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/458787 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [12:27:00] (03PS2) 10Alexandros Kosiaris: db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/458787 (https://phabricator.wikimedia.org/T203776) [12:28:07] (03CR) 10Jcrespo: [C: 04-1] "Note this points to codfw hosts but with an eqiad domain. Thanks for taking the time to do this- although not sure if we should in the fut" [dns] - 10https://gerrit.wikimedia.org/r/458787 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [12:28:48] (03CR) 10Jcrespo: [C: 031] db: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/458787 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [12:35:36] !log reboot kafka200[2,3] (eventbus codfw) for kernel + openjdk-8 upgrades [12:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:51] (03PS1) 10Alexandros Kosiaris: db: Switch dns master alias to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/458790 (https://phabricator.wikimedia.org/T203776) [12:37:26] (03PS2) 10Alexandros Kosiaris: db: Switch dns master alias to eqiad" [dns] - 10https://gerrit.wikimedia.org/r/458790 (https://phabricator.wikimedia.org/T203777) [12:38:20] (03PS3) 10Alexandros Kosiaris: db: Switch dns master alias to eqiad [dns] - 10https://gerrit.wikimedia.org/r/458790 (https://phabricator.wikimedia.org/T203777) [12:50:28] (03PS1) 10Giuseppe Lavagetto: sre.switchdc.mediawiki: Do not restart memcached before warming up [cookbooks] - 10https://gerrit.wikimedia.org/r/458793 [12:51:11] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists: Create a mailling list for Wiki Loves Love - https://phabricator.wikimedia.org/T203792 (10Psychoslave) [12:51:41] (03PS1) 10Alexandros Kosiaris: cache::upload: Switch swift temporarily to active/active [puppet] - 10https://gerrit.wikimedia.org/r/458794 (https://phabricator.wikimedia.org/T203776) [12:51:43] (03PS1) 10Alexandros Kosiaris: cache::upload: Move swift to codfw [puppet] - 10https://gerrit.wikimedia.org/r/458795 (https://phabricator.wikimedia.org/T203776) [12:51:45] (03PS1) 10Alexandros Kosiaris: cache::upload: Move swift to active/active [puppet] - 10https://gerrit.wikimedia.org/r/458796 (https://phabricator.wikimedia.org/T203777) [12:51:47] (03PS1) 10Alexandros Kosiaris: cache::upload: Move swift to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458797 (https://phabricator.wikimedia.org/T203777) [12:52:14] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:54:33] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:56:57] (03CR) 10Jcrespo: [C: 031] sre.switchdc.mediawiki: Do not restart memcached before warming up [cookbooks] - 10https://gerrit.wikimedia.org/r/458793 (owner: 10Giuseppe Lavagetto) [13:01:21] (03CR) 10Marostegui: "Just a small comment, which is not a big deal and can be done after the failover really" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/458787 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [13:03:06] (03PS2) 10Muehlenhoff: Remove host-specific Hiera settings for mw1307/mw1318 [puppet] - 10https://gerrit.wikimedia.org/r/458739 [13:05:58] (03CR) 10Muehlenhoff: [C: 032] Remove host-specific Hiera settings for mw1307/mw1318 [puppet] - 10https://gerrit.wikimedia.org/r/458739 (owner: 10Muehlenhoff) [13:08:04] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:09:48] (03PS1) 10Muehlenhoff: Add Cumin aliases for ATS [puppet] - 10https://gerrit.wikimedia.org/r/458800 [13:13:13] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [13:13:53] (03PS1) 10Alexandros Kosiaris: cache::text: Switch restbase active/active [puppet] - 10https://gerrit.wikimedia.org/r/458802 (https://phabricator.wikimedia.org/T203776) [13:13:55] (03PS1) 10Alexandros Kosiaris: cache::text, cache::upload: Switch services to codfw [puppet] - 10https://gerrit.wikimedia.org/r/458803 (https://phabricator.wikimedia.org/T203776) [13:13:57] (03PS1) 10Alexandros Kosiaris: cache::text, cache::upload: Switch services to a/a [puppet] - 10https://gerrit.wikimedia.org/r/458804 (https://phabricator.wikimedia.org/T203777) [13:13:59] (03PS1) 10Alexandros Kosiaris: cache::text: Switch restbase to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458805 (https://phabricator.wikimedia.org/T203777) [13:21:00] (03PS1) 10Alexandros Kosiaris: traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/458806 (https://phabricator.wikimedia.org/T203776) [13:21:02] (03PS1) 10Alexandros Kosiaris: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/458807 (https://phabricator.wikimedia.org/T203777) [13:21:18] (03PS1) 10Alexandros Kosiaris: traffic: route esams via codfw [puppet] - 10https://gerrit.wikimedia.org/r/458808 (https://phabricator.wikimedia.org/T203776) [13:21:20] (03PS1) 10Alexandros Kosiaris: Revert "traffic: route esams via codfw" [puppet] - 10https://gerrit.wikimedia.org/r/458809 (https://phabricator.wikimedia.org/T203777) [13:24:23] (03PS1) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [13:25:03] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [13:25:23] PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [13:25:34] <_joe_> akosiaris: I'm not sure I understand what's the plan for services [13:26:05] what do you mean ? [13:26:07] <_joe_> oh sorry [13:26:10] <_joe_> new gerrit UI [13:26:24] <_joe_> inverts the order of changes if you don't look in the right places [13:26:24] RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [13:26:24] :) [13:26:27] <_joe_> sigh [13:26:32] yeah it confuses me at times [13:26:36] <_joe_> you have two lists of the same changes [13:26:38] I do find it a tad better than the old one [13:26:38] <_joe_> one ordered [13:26:43] <_joe_> one unordered [13:26:45] <_joe_> me too [13:28:07] _joe_: send them over and I 'll take a look later [13:28:15] (03CR) 10Muehlenhoff: [C: 031] Decommission meitnerium (old archiva host) [dns] - 10https://gerrit.wikimedia.org/r/458783 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [13:28:25] (03CR) 10Jcrespo: "I know this is a quick first patch, but I hope this suggestions are useful." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [13:30:22] <_joe_> akosiaris: should I add a stupid recipe to: puppet-merge, disable puppet on the caches in both eqiad and codfw, run puppet there, then do the same in the other? [13:30:40] <_joe_> although your changes can all be applied everywhere at the same time [13:30:44] <_joe_> there is no reason for that [13:31:46] (03PS2) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [13:32:16] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [13:38:26] (03PS1) 10Muehlenhoff: Add new meta package for Linux 4.14 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/458812 [13:41:01] (03PS1) 10Muehlenhoff: Remove obsolete Hiera host entries [puppet] - 10https://gerrit.wikimedia.org/r/458814 [13:48:29] (03CR) 10Mark Bergsma: Don't recalculate server.up in refreshPreexistingServers (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/447766 (owner: 10Mark Bergsma) [13:48:50] (03CR) 10Ema: [C: 031] "Nice, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/458800 (owner: 10Muehlenhoff) [13:49:49] (03CR) 10Muehlenhoff: [C: 032] Add new meta package for Linux 4.14 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/458812 (owner: 10Muehlenhoff) [13:52:09] (03PS1) 10Elukey: profile::analytics::systemd_timer: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/458815 (https://phabricator.wikimedia.org/T172532) [13:55:10] (03PS2) 10Elukey: profile::analytics::systemd_timer: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/458815 (https://phabricator.wikimedia.org/T172532) [13:57:44] (03CR) 10Elukey: "The only weird thing is the naming of the alarms: https://puppet-compiler.wmflabs.org/compiler1002/12393/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/458815 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:00:34] (03CR) 10Ottomata: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/458815 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:02:36] (03CR) 10Elukey: "> Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/458815 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:03:19] (03CR) 10Ottomata: [C: 031] "Coo" [puppet] - 10https://gerrit.wikimedia.org/r/458815 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:06:26] 10Operations, 10Continuous-Integration-Infrastructure, 10Mail, 10Release-Engineering-Team, and 2 others: Ensure Jenkins mail configuration supports outbound smtp server failover - https://phabricator.wikimedia.org/T203607 (10herron) >>! In T203607#4564546, @hashar wrote: > I am definitely a fan of having J... [14:16:47] 10Operations, 10netops: Intermittent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10faidon) Has anything happened on this? IIRC at our meetings we talked about investigating this further e.g. with the help of JTAC, and exploring whether we should disable the JunOS' DDoS pro... [14:20:24] PROBLEM - HHVM rendering on mw2216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:21:12] (03CR) 10Elukey: [C: 032] profile::analytics::systemd_timer: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/458815 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:21:24] RECOVERY - HHVM rendering on mw2216 is OK: HTTP OK: HTTP/1.1 200 OK - 80161 bytes in 0.309 second response time [14:23:38] (03PS1) 10Elukey: role::analytics_cluster_coordinator: reduce notifications for analytics [puppet] - 10https://gerrit.wikimedia.org/r/458824 (https://phabricator.wikimedia.org/T172532) [14:24:13] (03CR) 10Elukey: [C: 032] role::analytics_cluster_coordinator: reduce notifications for analytics [puppet] - 10https://gerrit.wikimedia.org/r/458824 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:27:25] !log installing libtirpc security updates on trusty [14:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:06] (03PS3) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [14:29:35] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [14:29:55] !log installing PHP security updates on krypton [14:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:39] (03CR) 10Marostegui: Labs: Config template generation for pt-kill (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [14:33:26] (03PS1) 10Bstorm: clush: toolforge clush node lists need to be region aware [puppet] - 10https://gerrit.wikimedia.org/r/458825 [14:33:28] (03PS4) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [14:34:04] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [14:34:54] (03CR) 10Bstorm: [C: 032] clush: toolforge clush node lists need to be region aware [puppet] - 10https://gerrit.wikimedia.org/r/458825 (owner: 10Bstorm) [14:35:02] (03PS2) 10Bstorm: clush: toolforge clush node lists need to be region aware [puppet] - 10https://gerrit.wikimedia.org/r/458825 [14:40:02] PROBLEM - Check systemd state on analytics1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:40:46] this is me testing --^ [14:45:26] (03PS5) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [14:45:56] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [14:46:26] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Dzahn) approved pending subscription requests for ops and ops-private (that now showed up), though i see the boxes were already checked here. [14:48:45] (03PS6) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [14:49:06] (03PS7) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [14:49:30] PROBLEM - HHVM rendering on mw2217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:49:35] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [14:50:15] (03PS1) 10Elukey: profile::analytics::systemd_timer: fix bash nagios check script [puppet] - 10https://gerrit.wikimedia.org/r/458828 (https://phabricator.wikimedia.org/T172532) [14:50:30] RECOVERY - HHVM rendering on mw2217 is OK: HTTP OK: HTTP/1.1 200 OK - 80137 bytes in 0.310 second response time [14:50:54] (03CR) 10Elukey: [C: 032] profile::analytics::systemd_timer: fix bash nagios check script [puppet] - 10https://gerrit.wikimedia.org/r/458828 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:51:32] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10GTirloni) @Dzahn thanks! I think those were leftovers from when I tried to subscribe myself vs when Andre added me directly. [14:52:29] 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10Shahadat) [14:56:00] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:56:39] !log uploaded linux-meta 1.20+deb9u1 to apt.wikimedia.org/stretch-wikimedia (provides a new meta package for Linux 4.14) [14:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:44] (03PS1) 10Ema: ATS: websockets support for etherpad/phab, kartotherian port [puppet] - 10https://gerrit.wikimedia.org/r/458829 (https://phabricator.wikimedia.org/T199720) [14:58:14] (03PS1) 10Elukey: profille::analytics::systemd_timer: add simpler error/ok message [puppet] - 10https://gerrit.wikimedia.org/r/458830 (https://phabricator.wikimedia.org/T172532) [14:58:40] (03CR) 10Ema: [C: 032] ATS: websockets support for etherpad/phab, kartotherian port [puppet] - 10https://gerrit.wikimedia.org/r/458829 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [14:59:43] (03PS2) 10Elukey: profille::analytics::systemd_timer: add simpler error/ok message [puppet] - 10https://gerrit.wikimedia.org/r/458830 (https://phabricator.wikimedia.org/T172532) [15:02:20] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:02:48] (03PS3) 10Elukey: profille::analytics::systemd_timer: improve monitoring [puppet] - 10https://gerrit.wikimedia.org/r/458830 (https://phabricator.wikimedia.org/T172532) [15:03:27] (03CR) 10Elukey: [C: 032] profille::analytics::systemd_timer: improve monitoring [puppet] - 10https://gerrit.wikimedia.org/r/458830 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [15:05:34] (03CR) 10Subramanya Sastry: [C: 031] "I am at WikiLead and cannot really look into what all this is at this moment. I trust you all, so +1ing this. But, if you need a real revi" [puppet] - 10https://gerrit.wikimedia.org/r/458475 (owner: 10Giuseppe Lavagetto) [15:06:19] (03CR) 10Dzahn: [C: 031] Decommission meitnerium (old archiva host) [dns] - 10https://gerrit.wikimedia.org/r/458783 (https://phabricator.wikimedia.org/T203087) (owner: 10Elukey) [15:07:20] mutante: o/ [15:07:23] it's always fun to read other's comment [15:07:36] mutante: is there any doc about disabling a switch's port? [15:07:43] (which pings me lol) [15:08:08] 10Operations, 10Analytics, 10Patch-For-Review: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10Dzahn) note that the docs say "A Phabricator ticket for the decommission of the system should be placed in the #hardware-request project and the app... [15:08:36] elukey: i don't know, i don't have access [15:08:51] elukey: that part is usually done by robh [15:08:51] ahhh okok, I do but I am ignorant, will ask to Alex :) [15:08:58] or Rob of course :) [15:09:16] revi: luckily "gregarious" isn't a common word in codereview so I don't get pinged due to partial IRC output :) [15:09:41] elukey: see the part about "a ticket in #hardware-requests" at https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Decommission_Specific_(can_be_done_by_DC_Ops_without_network_switch_access) [15:09:57] 10Operations, 10Analytics, 10Patch-For-Review: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) >>! In T203087#4566582, @Dzahn wrote: > note that the docs say "A Phabricator ticket for the decommission of the system should be placed in... [15:10:00] elukey: there was also a template for the decom tickets, i swear it was in this place but now it's not.. [15:10:14] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10elukey) [15:10:29] elukey: oh.. i forgot for a second it's virtual, nevermind:) [15:10:29] mutante: my bad, for some reason I thought that vms were handled differently [15:10:35] ah! [15:11:13] greg-g: lol yeah probably wrong nick selection xD [15:11:21] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10Dzahn) I forgot for a moment this is a virtual machine. You can probably ignore that. Sorry. [15:18:42] who summons robh during coffee time? [15:18:45] ;] [15:20:23] oh, yeah ganeti vms are different ;] [15:21:10] robh: sorry for the coffee-ping-time! I wanted to learn about how to decom them and I spammed too many people :) [15:21:15] hehehe [15:21:20] my reply was joke not mad! [15:21:36] i didnt have sound on until i started owrking ;D [15:23:00] I know I know, I was joking as well :) [15:23:13] robh lol [15:27:00] (03PS8) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 [15:27:30] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [15:29:46] (03CR) 10Jcrespo: "We will talk about puppet profiles next week a bit more to make this validate." [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [15:30:05] (03CR) 10Dzahn: [C: 032] "ah, thanks for finding these, yep both are gone" [puppet] - 10https://gerrit.wikimedia.org/r/458814 (owner: 10Muehlenhoff) [15:30:35] (03PS2) 10Dzahn: Remove obsolete Hiera host entries [puppet] - 10https://gerrit.wikimedia.org/r/458814 (owner: 10Muehlenhoff) [15:30:41] (03CR) 10Banyek: "sure thing, I wanted to give this up for today anyway" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [15:35:24] (03PS3) 10Paladox: Replace "wikimedia-polygerrit-style" plugin with gerrit-theme and add CoC footer [puppet] - 10https://gerrit.wikimedia.org/r/458523 (https://phabricator.wikimedia.org/T196835) [15:38:42] (03PS4) 10Paladox: Replace "wikimedia-polygerrit-style" plugin with gerrit-theme [puppet] - 10https://gerrit.wikimedia.org/r/458523 (https://phabricator.wikimedia.org/T196835) [15:39:07] (03PS1) 10Paladox: Gerrit: Add footer link for CoC and Privacy Policy [puppet] - 10https://gerrit.wikimedia.org/r/458833 (https://phabricator.wikimedia.org/T196835) [15:39:23] (03PS2) 10Paladox: Gerrit: Add footer link for CoC and Privacy Policy [puppet] - 10https://gerrit.wikimedia.org/r/458833 (https://phabricator.wikimedia.org/T196835) [15:40:30] (03PS5) 10Paladox: Replace "wikimedia-polygerrit-style" plugin with gerrit-theme [puppet] - 10https://gerrit.wikimedia.org/r/458523 (https://phabricator.wikimedia.org/T196835) [15:40:32] (03PS3) 10Paladox: Gerrit: Add footer link for CoC and Privacy Policy [puppet] - 10https://gerrit.wikimedia.org/r/458833 (https://phabricator.wikimedia.org/T196835) [15:42:25] (03CR) 10Marostegui: Labs: Config template generation for pt-kill (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [16:03:15] (03PS5) 10Mark Bergsma: Test Server invariants [debs/pybal] - 10https://gerrit.wikimedia.org/r/445207 (https://phabricator.wikimedia.org/T184715) [16:03:17] (03PS4) 10Mark Bergsma: Remove Server.modified and refresh preexisting servers individually [debs/pybal] - 10https://gerrit.wikimedia.org/r/446614 [16:03:22] RECOVERY - Check systemd state on analytics1003 is OK: OK - running: The system is fully operational [16:04:39] (03CR) 10Mark Bergsma: Remove Server.modified and refresh preexisting servers individually (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/446614 (owner: 10Mark Bergsma) [16:06:50] 10Operations, 10monitoring, 10netops: Add virtual chassis port status alerting - https://phabricator.wikimedia.org/T201097 (10ayounsi) {F25691134} Putting the script here the time I send a Gerrit CR. It uses snimpy and the required MIBs can be obtained on https://apps.juniper.net/mib-explorer/index.jsp > m... [16:11:02] PROBLEM - Check systemd state on analytics1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:12:02] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:15:47] (03PS3) 10Dzahn: site: add tor_relay role to torrelay1001 [puppet] - 10https://gerrit.wikimedia.org/r/455744 (https://phabricator.wikimedia.org/T196701) [16:16:21] (03CR) 10Dzahn: [C: 032] "this will apply the role but keep the service stopped for now (disabled in hiera)" [puppet] - 10https://gerrit.wikimedia.org/r/455744 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [16:17:11] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:18:17] 10Operations, 10Continuous-Integration-Infrastructure, 10Mail, 10Release-Engineering-Team, and 2 others: Ensure Jenkins mail configuration supports outbound smtp server failover - https://phabricator.wikimedia.org/T203607 (10hashar) ``` hashar@contint1001:~$ nc 127.0.0.1 25 220 contint1001.wikimedia.org ES... [16:18:59] 10Operations, 10Continuous-Integration-Infrastructure, 10Mail, 10Jenkins, and 2 others: Ensure Jenkins mail configuration supports outbound smtp server failover - https://phabricator.wikimedia.org/T203607 (10hashar) a:03hashar [16:20:31] (03PS1) 10Dzahn: replace radium with torrelay1001 as tor-eqiad-1.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/458839 (https://phabricator.wikimedia.org/T196701) [16:22:18] (03PS1) 10Dzahn: lower TTL for tor-eqiad-1.wikimedia.org to 5M [dns] - 10https://gerrit.wikimedia.org/r/458840 (https://phabricator.wikimedia.org/T196701) [16:22:39] (03CR) 10Alex Monk: [C: 032] Make configurable the cmd executed to perform a DNS zone update [software/certcentral] - 10https://gerrit.wikimedia.org/r/458767 (https://phabricator.wikimedia.org/T203678) (owner: 10Vgutierrez) [16:23:26] (03CR) 10Dzahn: [C: 032] lower TTL for tor-eqiad-1.wikimedia.org to 5M [dns] - 10https://gerrit.wikimedia.org/r/458840 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [16:24:15] (03Merged) 10jenkins-bot: Make configurable the cmd executed to perform a DNS zone update [software/certcentral] - 10https://gerrit.wikimedia.org/r/458767 (https://phabricator.wikimedia.org/T203678) (owner: 10Vgutierrez) [16:24:22] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:25:49] (03CR) 10jenkins-bot: Make configurable the cmd executed to perform a DNS zone update [software/certcentral] - 10https://gerrit.wikimedia.org/r/458767 (https://phabricator.wikimedia.org/T203678) (owner: 10Vgutierrez) [16:25:56] (03PS2) 10Dzahn: replace radium with torrelay1001 as tor-eqiad-1.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/458839 (https://phabricator.wikimedia.org/T196701) [16:28:43] mutante: did you copy the crypto stuff over? [16:28:59] paravoid: makes sense that i keep it the name "tor-eqiad-1" when replacing the machine it points to, right? as opposed to making "tor-eqiad-2" [16:29:14] (03PS1) 10Alex Monk: Make configurable the cmd executed to perform a DNS zone update [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458842 (https://phabricator.wikimedia.org/T203678) [16:29:16] it does, but only if you do it properly! [16:29:17] that is [16:29:20] paravoid: https://phabricator.wikimedia.org/T196701#4536718 [16:29:20] stop tor on the old one [16:29:22] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [16:29:29] there is the migration plan :) [16:29:33] copy all the local crypto config over and all other kind of state [16:29:36] stop on old, rsync, start on new [16:29:36] ah heh [16:29:43] make sure the old never comes bakc [16:29:53] and i made it possible to "ensure: stopped" in hiera [16:29:53] 10Operations, 10Traffic, 10Patch-For-Review: certcentral: Make configurable the cmd executed to perform a DNS zone update - https://phabricator.wikimedia.org/T203678 (10Brandon) I don't really know what cmd or "DNS zone update" means... I barely use phabricator lol. Is this issue related to safari/iOS auto-... [16:29:54] nod [16:30:00] awesome [16:30:03] :) [16:30:21] (03PS2) 10Faidon Liambotis: tor: add an additional relay instance [puppet] - 10https://gerrit.wikimedia.org/r/399972 [16:30:25] I've made this a while ago [16:30:54] it's still pretty repetitive between the primary instance and the additional ones [16:30:55] 10Operations, 10Continuous-Integration-Infrastructure, 10Mail, 10Jenkins, and 2 others: Ensure Jenkins mail configuration supports outbound smtp server failover - https://phabricator.wikimedia.org/T203607 (10herron) Sounds like a plan! Will ping you Monday [16:31:03] (03CR) 10jerkins-bot: [V: 04-1] tor: add an additional relay instance [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [16:31:04] and the Family needs to be adjusted accordingly [16:31:11] plus that V-1 should be addressed [16:31:21] i can take a look at the V-1 [16:31:23] 10Operations, 10Traffic, 10Patch-For-Review: certcentral: Make configurable the cmd executed to perform a DNS zone update - https://phabricator.wikimedia.org/T203678 (10Krenair) >>! In T203678#4566976, @Brandon wrote: > I don't really know what cmd or "DNS zone update" means... I barely use phabricator lol.... [16:32:00] so family is set statically now to just those two instances, that needs to be addressed [16:32:14] the puppet code isn't really tested and I haven't really worked with puppet4's each before [16:32:19] (03CR) 10Alex Monk: [C: 032] "copy to this branch per vgutierrez" [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458842 (https://phabricator.wikimedia.org/T203678) (owner: 10Alex Monk) [16:32:20] and then it's still a bit repetitive [16:32:34] I don't think I'll find time to work on it, feel free to take it over :) [16:32:54] RECOVERY - Check systemd state on analytics1003 is OK: OK - running: The system is fully operational [16:33:37] paravoid: ok:) [16:33:52] thanks for working on this! [16:34:00] (03Merged) 10jenkins-bot: Make configurable the cmd executed to perform a DNS zone update [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458842 (https://phabricator.wikimedia.org/T203678) (owner: 10Alex Monk) [16:34:10] first i'll just get rid of radium to get rid of one jessie [16:34:35] then about adding a second one on torrelay1001 [16:34:54] PROBLEM - Tor DirPort on torrelay1001 is CRITICAL: connect to address 208.80.154.9 and port 9032: Connection refused [16:34:58] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Krenair) [16:35:08] 10Operations, 10Traffic, 10Patch-For-Review: certcentral: Make configurable the cmd executed to perform a DNS zone update - https://phabricator.wikimedia.org/T203678 (10Krenair) 05Open>03Resolved [16:35:15] well, that icinga check was supposed to be not alerting.. checking [16:35:33] (03CR) 10jenkins-bot: Make configurable the cmd executed to perform a DNS zone update [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458842 (https://phabricator.wikimedia.org/T203678) (owner: 10Alex Monk) [16:37:44] ACKNOWLEDGEMENT - Tor DirPort on torrelay1001 is CRITICAL: connect to address 208.80.154.9 and port 9032: Connection refused daniel_zahn new server not in prod [16:42:26] !log explicitely permit install1002/2002:80 in filter labs-in4 on cr1/2-eqiad - T190424 [16:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:32] T190424: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 [16:45:12] 10Operations, 10cloud-services-team, 10netops, 10Patch-For-Review: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10ayounsi) 05Open>03Resolved Thanks, this has been useful, especially running a packet capture on the working vs. non working flows.... [16:45:34] 10Operations, 10cloud-services-team, 10netops, 10Patch-For-Review: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424 (10ayounsi) a:05RobH>03ayounsi [17:04:57] despite adding the thirdparty repo in puppet explictly AND having "require => [ Apt::Repository['thirdparty-tor'],Exec['apt-get update']]" and not getting any errors.. i still have the distro package installed and the newer one is "have been kept back" [17:15:27] (03PS1) 10Dzahn: tor: ensure libzstd1 is installed and required if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/458848 (https://phabricator.wikimedia.org/T196701) [17:17:02] (03CR) 10Dzahn: [C: 032] tor: ensure libzstd1 is installed and required if on stretch [puppet] - 10https://gerrit.wikimedia.org/r/458848 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [17:19:16] (03CR) 10Dzahn: [C: 032] "removed the tor package, ran puppet again, and now we have " 0.3.3.9-1~d90.stretch+1" instead of 0.2.9.16-1 and no manual install steps :" [puppet] - 10https://gerrit.wikimedia.org/r/458848 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [17:28:59] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Root for Giovanni Tirloni - https://phabricator.wikimedia.org/T203494 (10Ottomata) a:03Andrew [17:29:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Root for Giovanni Tirloni - https://phabricator.wikimedia.org/T203494 (10Ottomata) p:05Triage>03Normal [17:30:55] 10Operations, 10LDAP-Access-Requests: Remove user "albe" from the wmde LDAP group - https://phabricator.wikimedia.org/T203561 (10Ottomata) p:05Triage>03Normal a:03Dzahn [17:31:16] 10Operations, 10LDAP-Access-Requests: Remove user "albe" from the wmde LDAP group - https://phabricator.wikimedia.org/T203561 (10Ottomata) 05Open>03Resolved Looks like we can resolve, thanks. [17:31:45] (03PS1) 10Ayounsi: Icinga: add check_vcp (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) [17:32:40] (03PS3) 10Ayounsi: Add SNMP classes [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [17:33:56] (03PS4) 10Ayounsi: Add SNMP classes [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [17:37:34] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:37:44] (03PS5) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 [17:39:14] 10Operations, 10LDAP-Access-Requests: Remove user "albe" from the wmde LDAP group - https://phabricator.wikimedia.org/T203561 (10Dzahn) 05Resolved>03Open Not yet please, per "Keeping the ticket open to check whether the NDA group removal needs a process with legal or not." and meanwhile i know it does.. R... [17:40:55] !log apply firewall changes on pfw3-codfw - T203793 [17:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:43] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:45:34] !log apply firewall changes on pfw3-eqiad - T203793 [17:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:10] 10Operations, 10LDAP-Access-Requests: Remove user "albe" from the wmde LDAP group - https://phabricator.wikimedia.org/T203561 (10RStallman-legalteam) Thanks and sorry for my delay. I have noted this on the shared spreadsheet and updated the NDA contract record with this info. [17:52:52] !log legoktm@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/EUCopyrightCampaign/: Update MEPs - https://gerrit.wikimedia.org/r/458628 (duration: 00m 50s) [17:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:06] CindyCicaleseWMF: ^^ [17:54:01] legoktm: thanks! checking now [17:54:07] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10EBjune) @Dzahn the email address was granted last week, thanks for checking up on that! [17:55:38] (03PS29) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [17:55:50] 10Operations, 10LDAP-Access-Requests: Remove user "albe" from the wmde LDAP group - https://phabricator.wikimedia.org/T203561 (10Dzahn) 05Open>03Resolved cool, thanks Rachel. re-closing it then :) [18:02:11] !log legoktm@deploy1001 Synchronized php-1.32.0-wmf.20/extensions/EUCopyrightCampaign/: Update MEPs - https://gerrit.wikimedia.org/r/458628 (for real this time) (duration: 00m 50s) [18:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:47] legoktm: It works! Thank you! [18:06:57] :) [18:09:08] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) [18:10:01] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) Thanks @EBjune ! I fixed the mailing list situation first. subscribed to ops and ops-private with the wikimedia.org address (and removed the former one where i... [18:11:43] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) [18:12:35] !log LDAP: added user 'monipe' to group 'wmf' (T202708) [18:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:40] T202708: Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 [18:13:38] Can you "before => Package['something']" when using "require_package('something')" ? [18:14:31] also why use require_package when I can just put a Package resource in? convenience? [18:15:34] if you have 2 classes both installing the same package and you apply them on the same node.. one will cause an puppet error about duplicate definition and the other would work [18:16:06] not sure about the first one off hand [18:20:21] !log LDAP: correction, 'monipe' replaced with 'onimisionipe' in wmf group (T202708) [18:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:30] T202708: Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 [18:23:11] mutante, I ask because I tried to make an apt::pin precede my requires_package [18:23:12] and got this [18:23:20] Error: Failed to apply catalog: Found 1 dependency cycle: [18:23:20] (File[/etc/apt/preferences.d/acme.pref] => Apt::Pin[acme] => Package[python3-certcentral] => Class[Packages::Python3_certcentral] => Class[Certcentral::Central] => Apt::Pin[acme] => File[/etc/apt/preferences.d/acme.pref]) [18:23:38] Now it's logical up to the Class[Packages::Python3_certcentral] point [18:24:01] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) Added to "wmf" LDAP group, which gives these permissions, including Icinga: https://wikitech.wikimedia.org/wiki/LDAP/Groups#Specific_groups "ops" LDAP group... [18:24:29] and actually including Class[Certcentral::Central] [18:24:39] But then that's needed to complete Apt::Pin[acme] ? don't get it [18:25:04] Just going to go with a package resource, I don't think we'll run into that particular case you mention that require_package solves [18:26:20] Krenair: ehm.. ok, yea, fair enough [18:27:05] given the nature of the package and the hosts I expect it to get applied to [18:30:48] (03PS30) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [18:32:53] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:35:43] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) [18:36:39] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) Logstash, Tendril, Graphite, Grafana, Icinga, Piwik should all work now [18:41:01] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Dzahn) [18:43:03] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:50:23] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:55:23] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:58:07] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Dzahn) You can probably skip the "racktables" checkbox in favor of https://netbox.wikimedia.org/ instead. Though T199083 is still open it seems close. [18:59:21] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Mathew.onipe) SSH Keys Production: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCzc2xF3S5UnIFMnK0gbGBhYvsN+xjiJejtEZUnGb23vJTX7N955C7dBdHkTHpcV7+yWqpzzWkJCpnRs5Q0P+JyQ5hOikv... [19:01:20] 10Operations, 10cloud-services-team: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10GTirloni) [19:05:24] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:07:37] (03PS1) 10Framawiki: Create Cookbook NS in bnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458870 (https://phabricator.wikimedia.org/T203534) [19:08:13] (03CR) 10Framawiki: "Note that namespaceDupes.php maintenance script run will be needed after the deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458870 (https://phabricator.wikimedia.org/T203534) (owner: 10Framawiki) [19:09:44] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:12:17] (03PS1) 10Krinkle: gitignore: Remove redundant ignore for vendor/slim [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458871 [19:12:38] (03CR) 10Krinkle: [C: 032] gitignore: Remove redundant ignore for vendor/slim [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458871 (owner: 10Krinkle) [19:14:34] (03Merged) 10jenkins-bot: gitignore: Remove redundant ignore for vendor/slim [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458871 (owner: 10Krinkle) [19:14:49] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10pmiazga) The task is nearly ready, the remaining bits and pieces: - we need to merge the last patch for T181623 - the new flow for closing... [19:19:05] 10Operations: Add which ldap groups can login on netbox login form - https://phabricator.wikimedia.org/T203840 (10Framawiki) [19:22:46] (03CR) 10jenkins-bot: gitignore: Remove redundant ignore for vendor/slim [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458871 (owner: 10Krinkle) [19:27:33] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:30:20] (03PS1) 10Dzahn: admins: create shell user for Mathew Onipe [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) [19:31:22] (03PS2) 10Dzahn: admins: create shell user for Mathew Onipe [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) [19:33:02] (03CR) 10Gehel: [C: 031] "Thanks Mutante for taking care of that!" [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [19:34:07] (03PS24) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [19:37:53] 10Operations, 10netops, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) Cabling has been done out of order, but end result is there. (minus the 7m DAC). During the re-cabling, the fabric was very unstable: frequent disconnect... [19:38:23] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:40:32] 10Operations: Add which ldap groups can login on netbox login form - https://phabricator.wikimedia.org/T203840 (10Aklapper) "add" to what / where? [19:41:19] 10Operations: Add which ldap groups can login on netbox login form - https://phabricator.wikimedia.org/T203840 (10Krenair) Presumably the login form itself. [19:41:53] (03CR) 10Ebjune: [C: 031] admins: create shell user for Mathew Onipe [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [19:43:26] 10Operations: Add which ldap groups can login on netbox login form - https://phabricator.wikimedia.org/T203840 (10Krenair) It's the ops group btw: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/netbox/templates/ldap_config.py.erb$39 [19:45:44] 10Operations: Add which ldap groups can login on netbox login form - https://phabricator.wikimedia.org/T203840 (10Dzahn) login form itself and also should be on https://wikitech.wikimedia.org/wiki/LDAP/Groups#Specific_groups [19:47:22] 10Operations: Add which ldap groups can login on netbox login form - https://phabricator.wikimedia.org/T203840 (10Krenair) >>! In T203840#4567938, @Dzahn wrote: > also should be on https://wikitech.wikimedia.org/wiki/LDAP/Groups#Specific_groups well that bit's easy, done [19:50:33] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Onboarding Mathew Onipe - https://phabricator.wikimedia.org/T202708 (10Gehel) Thanks @Dzahn to move this forward! I was stalling this for too long. For the next steps: * I'll propose adding @Mathew.onipe to elasti... [19:53:42] 10Operations, 10netops, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10BBlack) I think, it's hard to evaluate the stability of the intended, supported VCF design while in an intermediate state. It's also probably not reasonable to ex... [19:53:43] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:54:24] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is inactive [19:57:44] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [19:58:04] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:59:23] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: /srv 52609 MB (10% inode=99%) [20:14:43] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: /srv 49681 MB (10% inode=99%) [20:25:36] 10Puppet, 10Toolforge, 10Documentation, 10User-srodlund: Document our GridEngine set up - https://phabricator.wikimedia.org/T88733 (10srodlund) [20:28:54] RECOVERY - Disk space on elastic1030 is OK: DISK OK [20:30:44] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:31:21] (03CR) 10Gehel: "Looking good! We're getting close!" (0315 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [20:31:34] (03CR) 10Mathew.onipe: Elasticsearch module is coming up. (0312 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [20:32:54] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:35:29] (03CR) 10Gehel: Elasticsearch module is coming up. (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [20:41:08] (03PS1) 10Mathew.onipe: elasticsearch shard size check * Checks shard size and sends alert if more than 30gb. [puppet] - 10https://gerrit.wikimedia.org/r/458891 [20:41:49] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch shard size check * Checks shard size and sends alert if more than 30gb. [puppet] - 10https://gerrit.wikimedia.org/r/458891 (owner: 10Mathew.onipe) [20:48:00] (03CR) 10jenkins-bot: Generate documentation with Sphinx [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398462 (owner: 10Hashar) [20:48:28] (03CR) 10Hashar: "The doc is now on https://doc.wikimedia.org/docker-pkg/ :]" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/398462 (owner: 10Hashar) [20:56:54] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:59:04] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:01:39] (03CR) 10C. Scott Ananian: "> I note this will cause $wgUseTidy to be false rather than true on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [21:04:21] (03PS5) 10C. Scott Ananian: Remove $wgUseTidy and $wgTidyConfig from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) [21:05:39] 10Operations, 10SRE-Access-Requests: Requesting access to researchers for kharlan - https://phabricator.wikimedia.org/T203847 (10kostajh) [21:05:49] (03PS9) 10Dzahn: Gerrit: Move all logging to /var/log/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [21:05:51] (03CR) 10C. Scott Ananian: [C: 04-2] "Self C-2 until Ie30c9174e6e3b60bce5a692296a9de1e30192e2c is merged; I think that's the only dependency on $wgUseTidy=true which is relevan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443645 (https://phabricator.wikimedia.org/T175706) (owner: 10C. Scott Ananian) [21:05:57] 10Operations, 10SRE-Access-Requests: Requesting access to researchers for kharlan - https://phabricator.wikimedia.org/T203847 (10kostajh) [21:07:54] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:07:54] (03CR) 10Dzahn: [C: 031] "> Before merging this change you copy the logs from "/var/lib/gerrit2/review_site/logs" to "/var/log/gerrit"" [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [21:10:04] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:16:30] jouncebot: next [21:16:31] In 61 hour(s) and 13 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180910T1030) [21:17:08] a quick gerrit restart is coming soon [21:18:44] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:18:58] (03CR) 10Dzahn: [C: 032] Gerrit: Move all logging to /var/log/gerrit [puppet] - 10https://gerrit.wikimedia.org/r/423794 (owner: 10Chad) [21:19:31] :) [21:20:50] paladox: http://www.peteimbesi.com/html/CYOA/Sad_Trombone.mp3 [21:21:08] mutante lol, did it fail? [21:21:13] Error while evaluating a Resource Statement, Duplicate declaration: File[/var/lib/gerrit2/review_site/logs] [21:21:21] oh [21:21:42] i forgot to remove that [21:21:43] * paladox does the patch [21:21:59] thanks:) cobalt isnt affected, i tested on gerrit2001 first [21:22:07] (03PS1) 10Paladox: Gerrit: Remove duplicate /var/lib/gerrit2/review_site/logs [puppet] - 10https://gerrit.wikimedia.org/r/458904 [21:22:34] (03PS2) 10Paladox: Gerrit: Remove duplicate /var/lib/gerrit2/review_site/logs [puppet] - 10https://gerrit.wikimedia.org/r/458904 [21:22:40] mutante ^^ [21:23:17] thanks, *nod* [21:23:25] (03CR) 10Dzahn: [C: 032] Gerrit: Remove duplicate /var/lib/gerrit2/review_site/logs [puppet] - 10https://gerrit.wikimedia.org/r/458904 (owner: 10Paladox) [21:24:41] (03PS8) 10Zhuyifei1999: quarry: Move the install into a venv and upgrade to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) [21:25:04] paladox: some more issues. the first one not very surprising and can be fixed by hand: [21:25:11] change from directory to link failed: Could not remove existing file [21:25:17] yep [21:25:22] need to follow the instructions :) [21:25:24] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:25:27] (on change) [21:25:32] but also: Gerrit::Proxy/File[/etc/apache2/ports.conf]: Dependency File[/var/lib/gerrit2/review_site/logs] [21:25:43] there are more dependencies on the old file [21:25:44] need to manually create /var/log/gerrit (shoulden't need to do that if gerrit was to ever be reinstalled) [21:25:49] then copy logs over [21:25:49] i did [21:25:53] that part [21:26:03] and then rm -rf /var/lib/gerrit2/review_site/logs [21:26:05] and re run gerrit [21:26:13] *puppet [21:26:27] that's ok, yep. but still a separate puppet code issue [21:26:40] i guess it needs force = true? [21:26:43] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/gerrit2/review_site/logs] [21:26:50] CSI:Wikimedia-Puppet [21:26:57] the new series [21:28:28] ACKNOWLEDGEMENT - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/gerrit2/review_site/logs] daniel_zahn moving log location [21:28:38] lol [21:30:35] (03PS1) 10Bearloga: Updates to Product Analytics profiles and roles [puppet] - 10https://gerrit.wikimedia.org/r/458907 [21:31:44] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [21:31:53] !log gerrit2001, moving gerrit logfiles to /var/log/gerrit, removing old gerrit logdir, letting puppet re-create it as symlink [21:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:34] a hollywood production [21:52:47] (03CR) 10Smalyshev: Enable dailies everywhere [puppet] - 10https://gerrit.wikimedia.org/r/456170 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [21:57:55] (03PS1) 10Paladox: Gerrit: Add owner and group to "/var/lib/gerrit2/review_site/logs" [puppet] - 10https://gerrit.wikimedia.org/r/458920 [21:58:29] (03PS2) 10Paladox: Gerrit: Add owner and group to "/var/lib/gerrit2/review_site/logs" [puppet] - 10https://gerrit.wikimedia.org/r/458920 [21:58:49] (03PS3) 10Paladox: Gerrit: Add owner and group to "/var/lib/gerrit2/review_site/logs" [puppet] - 10https://gerrit.wikimedia.org/r/458920 [21:58:49] mutante ^^ [21:59:58] paladox: i had memories of https://projects.puppetlabs.com/issues/15371 but that's old [22:00:08] i do see content in error_log now [22:01:10] 10Operations: Trying to install updated versions of "linux-meta linux-meta-4.9" fails - https://phabricator.wikimedia.org/T203851 (10Paladox) [22:01:16] mutante :) [22:02:57] 10Operations: Trying to install updated versions of "linux-meta linux-meta-4.9" fails - https://phabricator.wikimedia.org/T203851 (10Dzahn) 17:53 < mutante> paladox: linux-image-4.9.0-0.bpo.7 is available, but .8 is not. not saying i know why exactly, but i checked apt.wikimedia.org 17:53 < mutante> maybe there... [22:03:33] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:05:05] (03CR) 10Dzahn: [C: 032] "not really needed but looks nicer" [puppet] - 10https://gerrit.wikimedia.org/r/458920 (owner: 10Paladox) [22:05:43] (03CR) 10Dzahn: [C: 032] "WIP?" [puppet] - 10https://gerrit.wikimedia.org/r/458920 (owner: 10Paladox) [22:05:45] paladox: did you publish the draft? [22:05:54] mutante yep i unmarked it as wip [22:06:08] and the button turned blue :) [22:06:15] the new pg screen with status badge would have made me press that unwip button quicker :) [22:06:23] wasen't obvous in the old ui [22:06:27] heh [22:07:14] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:08:34] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:12:23] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:13:33] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:15:43] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:16:11] !log - cobalt (gerrit) - applying change to move log file location, manually moved logs to /var/log/gerrit, remove old log dir, let puppet re-create it, like on gerrit2001 [22:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:19] !log gerrit - restarting for config change to move log files to /var/log/gerrit/ [22:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:32] mutante: oh man thanks for that (/var/log/gerrit) [22:19:59] hashar: :) it always bugged me too [22:20:11] i did it now when fewer people are uploading stuff [22:20:27] thanks to paladox for taking over chad's change [22:20:33] :) [22:20:55] you even left a symlink behind [22:20:56] great [22:20:58] i was suprised how fast the restart was [22:21:24] i can confirm new logs are written in new place [22:21:42] i guess it's faster due to less db constrait. [22:21:48] as long as it can reach the db it won't interfear with gerrit. [22:22:00] it felt much faster than i was used to from the past [22:22:14] yep [22:22:57] mutante: and I think we have (or had) gerrit logs sent to logstash somehow [22:23:05] yep [22:23:13] though gerrit has flogger now in the master branch [22:23:21] hashar: but "not all" afaict [22:23:30] and has some kind of thingy that allows log4j [22:23:43] just pasted an error from log4j to paladox [22:23:47] that was unrelated to this change [22:24:24] log4j:WARN Detected problem with connection: java.net.SocketException: Broken pipe (Write failed) [22:24:45] that was popping up in 'systemctl status gerrit' [22:24:53] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:25:14] anyway, I am off for the week and some rest [22:25:25] good week-end [22:25:33] good weekend too hashar, cya [22:29:54] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:39:11] (03PS31) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [22:39:35] (03PS9) 10Zhuyifei1999: quarry: Move the install into a venv and upgrade to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) [22:46:22] (03PS32) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [22:46:45] (03PS6) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 [22:47:20] 10Operations, 10Analytics, 10procurement: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10RobH) p:05Triage>03Normal [22:47:32] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10RobH) [22:48:43] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10RobH) So my racking proposal may be overly restrictive. I'm not sure what services will share between stat boxes, so to be safe I simply stated not to ra... [23:23:28] !log ms-be1041 - repairing xfs per https://wikitech.wikimedia.org/wiki/Swift/How_To#Repair_xfs_free_blocks_counter_corruption (T199198) [23:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:34] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [23:25:01] !log ms-be2041 - repairing /dev/sdh1 (T199198) [23:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:35] !log ms-be2042 - repairing /dev/sdj1 (T199198) [23:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:24] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [23:40:33] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [23:46:13] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /srv 51504 MB (10% inode=99%) [23:51:40] (03PS1) 10Dzahn: tor_relay: stop service on radium, start on torrelay1001 [puppet] - 10https://gerrit.wikimedia.org/r/458932 (https://phabricator.wikimedia.org/T196701) [23:52:44] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /srv 50643 MB (10% inode=99%) [23:54:14] (03PS33) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [23:55:59] (03PS7) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 [23:57:43] RECOVERY - Filesystem available is greater than filesystem size on ms-be2041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [23:58:14] (03CR) 10Dzahn: [C: 032] tor_relay: stop service on radium, start on torrelay1001 [puppet] - 10https://gerrit.wikimedia.org/r/458932 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn)