[00:04:06] code sync'ed everywhere [00:04:47] it doesn't work properly [00:05:22] https://cs.wiktionary.org/wiki/Speci%C3%A1ln%C3%AD:Co_odkazuje_na?target=Modul%3AQuote%2Ftools&namespace=10&title=Speci%C3%A1ln%C3%AD%3ACo_odkazuje_na [00:05:52] "skrýt" (hide) isn't link which toggles to "zobrazit" (show) [00:07:16] !log dereckson@tin Finished scap: Revert "Convert Special:WhatLinksHere from XML form to OOUI form" ([[Gerrit:289772]], T135773) (duration: 34m 12s) [00:07:17] T135773: [Regression] Special:WhatLinksHere is unusable - https://phabricator.wikimedia.org/T135773 [00:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:07] Danny_B: and now? [00:10:07] still nothing - can messages be reverted as well? [00:10:23] Dereckson: hmm, seems that message rebuilt there failed or something... [00:10:31] (and whoever made the patch be taught that repurposing messages is evil?) [00:10:31] * thedj not sure [00:10:38] we only reverted the english translations right...? [00:10:45] oh... [00:10:56] good point [00:11:00] 00:43:29 < MatmaRex> i wonder how many translations were "updated" in the meantime [00:11:05] 00:52:37 < Danny_B> MatmaRex: at least 28 langs [00:11:05] 00:52:56 < Danny_B> (based on last 5000 edits on twn) [00:11:13] so all the translations will be dumped [00:11:17] ffs [00:11:59] oh hell [00:11:59] well it's 2:11 am for me and i wouldn't even know how to do that either :/ [00:12:17] just revert a bunch of localisation updates in core [00:12:22] "just" [00:12:34] all the TWN ppl seem to be in sleeping TZs too [00:13:00] but i think we have LocalisationUpdate? which pulls messages from core or TWN, can't remember [00:13:22] so this should fix itself entirely in a couple hours/days, in the worst case, when people update the translations again [00:13:48] (or am i misremembering how this works?) [00:14:19] We could include in the next tech news this update is urgent? [00:14:34] So people will check on TranslateWiki if it has been translated for them? [00:14:56] Danny_B: Ajayrahulp [00:15:17] MatmaRex: l10nupdate runs around 03:00z each day and copies in all the messages from master [00:16:02] i could probably revert all the changes on translatewiki, and then manually pull in translated messages into core [00:16:17] (i wrote a little script for the latter just a few days ago) [00:16:42] You want to do that now? [00:16:44] if it'll be updated from master, like bd808 says, i think we could skip deploying it manually [00:21:50] anyway, yeah, i'll try doing that [00:22:07] k [00:25:39] man, it's pretty amazing how quickly this got so many translations [00:26:33] for future reference, i'm looking at https://translatewiki.net/w/i.php?title=Special%3ATranslations&message=whatlinkshere-hidetrans&namespace=8 and reverting all the changes made last English message update (light blue background) [00:31:32] that should have created new message keys instead of re-using the old ones -.- [00:32:02] yep [00:33:48] MatmaRex: it's highly used page so no doubt it was quick [00:34:02] also it is three or four messages to be reverted [00:38:26] I've notifified the patch author @ https://phabricator.wikimedia.org/T135773#2311312 [00:38:44] (03PS2) 10Dzahn: RT: do not ensure=>latest,install perldoc [puppet] - 10https://gerrit.wikimedia.org/r/289796 (https://phabricator.wikimedia.org/T119112) [00:39:28] Dereckson: thanks! [00:40:15] Dereckson: could that task be actually closed? [00:41:06] and the notification put rather to T117754 [00:41:06] T117754: Convert Special:WhatLinksHere to OOUI - https://phabricator.wikimedia.org/T117754 [00:41:36] perhaps MatmaRex wants to attach the l10n revert to T135773 too [00:41:36] T135773: [Regression] Special:WhatLinksHere is unusable - https://phabricator.wikimedia.org/T135773 [00:43:34] (i'm about 70% done with reverting, then i'll have to import them) [00:44:07] i've copied thenotification comment [00:44:44] can you folks take a screenshot of the current broken state, for future reference? [00:44:58] (so that i can bash people over the head with it if anyone tried to do this again ;) ) [00:45:01] 06Operations, 10Wikimedia-Mailing-lists: Reset Mailman List Creator password - https://phabricator.wikimedia.org/T135776#2311323 (10Dzahn) I reset the list creator password on the server, fermium. docs: There is the **mmsitepass** command (/usr/sbin/ linked to /var/lib/mailman/bin/) **mmsitepass -c** to r... [00:47:41] MatmaRex: https://s3.amazonaws.com/upload.screenshot.co/93ac7370e0 [00:48:04] MatmaRex: https://cs.wiktionary.org/w/index.php?title=Speci%C3%A1ln%C3%AD%3ACo+odkazuje+na&target=Hlavn%C3%AD+strana&namespace=10&uselang=cs [00:48:28] (03PS3) 10Dzahn: RT: do not ensure=>latest,install perldoc [puppet] - 10https://gerrit.wikimedia.org/r/289796 (https://phabricator.wikimedia.org/T119112) [00:48:58] MatmaRex: {F4032805} [00:49:04] (03PS4) 10Dzahn: RT: do not ensure=>latest,install perldoc [puppet] - 10https://gerrit.wikimedia.org/r/289796 (https://phabricator.wikimedia.org/T119112) [00:49:09] (I've uploaded it on Phab) [00:49:40] F4032805 [00:50:03] oh, stashbot, where art thou? [00:51:18] Danny_B: https://phabricator.wikimedia.org/F4032805 [00:51:32] "{F4032805}" is what to use to have it displayed in a task [00:51:41] ok, i'm done [00:51:48] oh wait, someone reverted four of my reverts. lol [00:51:53] xD [00:52:07] !log stashbot test (T122690) [00:52:08] i know... stashbot links tasks, should link other phab stuff as well [00:52:08] T122690: Move stashbot tool to k8s - https://phabricator.wikimedia.org/T122690 [00:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:53:16] (03CR) 10Dzahn: [C: 032] RT: do not ensure=>latest,install perldoc [puppet] - 10https://gerrit.wikimedia.org/r/289796 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [00:54:07] stashbot: because you think everybody will `wget -O cswiktionary-msg-repurposed-l10n-issue.png https://s3.amazonaws.com/upload.screenshot.co/93ac7370e0`, `arc upload cswiktionary-msg-repurposed-l10n-issue.png` and then says "{F...}" on the channel? [00:54:12] Danny_B: ^ [00:55:10] Bots should handle most general cases, not every corner case used once per year. [00:55:38] that's incorrect case [00:57:05] when saying F123456 i expect stashbot to return "F123456: Image name - https://phabricator.wikimedia.orgF123456" - similar to what it does for tasks [00:57:35] yes, but are people saying F123456? [00:58:42] this came up before, afair we agreed some are nice to have, like P(astebins) but probably not F's [00:58:43] sometimes when discussing some screenshots? same with pastes i guess. it isn't harmful and it is user friendly since it saves work... [00:59:12] P666 [00:59:13] P666 salt wtf? - https://phabricator.wikimedia.org/P666 [00:59:18] there you go [00:59:21] It does DMPT now -- https://github.com/bd808/tools-stashbot/blob/master/stashbot/bot.py#L33 [00:59:47] :) [00:59:48] pull requests welcome :) [01:00:41] and i wonder .. why does this server have both Apache modules, mod-fastcgi AND mod-fcgid [01:00:59] out of curiosity: why don't we host such bot tools on our vcs? (gerrit/diffusion/...) [01:01:51] beacuse I wrote it, I run it, and I didn't bother to go through the hassle of requesting gerrit hosting [01:02:19] Danny_B: are you asking about bots in general, or this specific bot? [01:02:26] I'm working on making diffusion hosting of tools easier right now though [01:03:03] it was rather general question since bunch of other tools/bots are somewhere in outer space, but we do bugtracking for them in phab (formerly in bz) [01:03:22] bd808: <3 for that! [01:03:58] Danny_B: you can follow along at T133252 [01:03:58] T133252: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252 [01:05:35] The choice of hosting is a nuanced thing. Almost everyone wants it to be easy and fast to get a repo [01:05:54] some also want to build a "resume" of sorts on github or bitbucket [01:06:42] we require hosting or at least mirroring in gerrit/diffusion for production deploys [01:06:57] gerrit projects get cloned over to github. i think for diffusion that is in the works? [01:07:06] for things in Tool Labs I'm just really happy if source is published *anywhere* [01:07:08] I just have all my stuff on GitHub out of habit [01:07:21] and my stuff is split between GitHub Pages and Tool Labs [01:09:20] okayyyy. i think i've got this [01:10:03] bd808: Danny_B avoid F5 please :p [01:10:23] github is a very centralized thing while git itself is all about not being centralized [01:11:12] github has been a pretty amazing replacement for prior attempts like sourceforge [01:11:26] Dereckson: ??? [01:11:28] in a way its good when things are split across different tools and not all in the same place [01:11:44] but I don't like the current assumption by many that it is the *only* place to find FLOSS software [01:12:08] (notbody who says FLOSS believes that but ...) [01:12:51] Danny_B: try to press F5 https:///F5 [01:13:17] I read a disturbing essay recently that postulated that we are in a post-open source world where licensing doesn't matter [01:13:25] hmm. interesting [01:13:30] Dereckson: bd808: Danny_B: https://gerrit.wikimedia.org/r/289802 [01:13:49] a few languages apparently have massive diffs in my patch, because… their JSON files were indented with spaces rather than tabs? [01:14:17] i think these might not have been exported from translatewiki for a long time, probably intentionally? [01:14:28] we should focus on making it easy to request new repos on our own tools , i dont know about diffusion yet [01:14:40] bd808: heh, wanna close the tickets about licensing with that ?:) [01:14:54] mutante: lol. hell no [01:16:29] MatmaRex: use /paste for scripts, you gain syntax highlighting [01:17:21] MatmaRex: https://phabricator.wikimedia.org/P2868 if you wish a command to do arc-paste-file load-translations2.rb [01:20:07] MatmaRex: this one looks funky -- https://gerrit.wikimedia.org/r/#/c/289802/1/languages/i18n/kk-cyrl.json [01:21:40] bd808: hmmm [01:21:47] (03PS2) 10Dzahn: RT: loading mod_fastcgi wasnt puppetized [puppet] - 10https://gerrit.wikimedia.org/r/289795 (https://phabricator.wikimedia.org/T119112) [01:22:03] bd808: that language has LanguageConverter… blergh [01:26:14] bd808: yeah, something went weird in my script / the API… [01:26:34] i have had enough of this, i think. that one language change should probably be undone in the commit (it's okay on translatewiki) [01:27:02] bd808: Dereckson: feel free to tweak that commit and +2, or ignore it - the next l10n-bot run will include these fixes anyway [01:27:38] i'm going to go sleep. see you :) [01:27:43] I suggest we remove problematic languages files and wait fo the bot. [01:27:48] Good night. [01:28:22] PROBLEM - MariaDB Slave Lag: s4 on db1040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 377.86 seconds [01:31:15] (strange, for other *-* that works) [01:32:49] let me know if you need some tests of that message revert [01:38:41] (03PS3) 10Dzahn: RT: loading mod_fastcgi wasnt puppetized [puppet] - 10https://gerrit.wikimedia.org/r/289795 (https://phabricator.wikimedia.org/T119112) [01:43:04] bd808: finally I removed gan-hant, kk-cyrl, kk-latn and zh-hant [01:44:09] Dereckson: +2'd [01:44:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [01:44:50] Okay, so we wait 3:00 UTC for l10nupdate or we sync the change now? [01:44:53] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [01:45:18] I think we can just wait for l10nupdate to fix things [01:45:52] If it's still messed up tomorrow we can backport and scap again [01:46:00] s/we/someone/ [01:46:02] :) [01:47:47] That's fine with me. [01:57:05] (03CR) 10Dzahn: [C: 032] RT: loading mod_fastcgi wasnt puppetized [puppet] - 10https://gerrit.wikimedia.org/r/289795 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [01:57:13] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:57:23] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:00:23] Dereckson: if you'll be waiting for l10n update, then assure that no matmarex's changes got reverted on twn [02:06:16] Danny_B: hmmm? [02:06:58] Danny_B > according https://wikitech.wikimedia.org/wiki/LocalisationUpdate "translatewiki.net staff commit translations to trunk" [02:07:57] so there won't be a commit with new translations to MediaWiki core in the MediaWiki master branch [02:12:14] Danny_B: what the job do is create a commit to the current wmf branches to backport the master l10n files changes [02:12:37] so we'll get 5bd7b72 [02:13:09] (well we'll get a changes with the difference between wmf branch and core branch for l10n files) [02:13:27] s/core/master [02:22:31] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2311397 (10Dzahn) command for db schema upgrades: rt-setup-database-4 --dba rt --action upgrade --upgrade-from 4.0.4 --upgrade-to 4.2.8 it will ask for credentials to the db [02:27:13] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2311402 (10Dzahn) there is also upgrade-mysql-schema.pl in /usr/share/request-tracker4/etc/upgrade/ which i tried before the former command and it created queries that i executed on the db, but basically jus... [03:19:00] 06Operations, 10Traffic, 07HTTPS: Secure connection failed when attempting to send POST request - https://phabricator.wikimedia.org/T134869#2311448 (10Thibaut120094) >>! In T134869#2281369, @BBlack wrote: > As a random experiment, perhaps some of those reporting could try this in FF 46.0.1? > > 1. Type 'abo... [04:20:22] PROBLEM - HHVM jobrunner on mw1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:21:57] (03PS1) 10Muehlenhoff: Update to 4.4.11 [debs/linux44] - 10https://gerrit.wikimedia.org/r/289813 [04:24:12] RECOVERY - HHVM jobrunner on mw1014 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.007 second response time [04:24:33] !log restarted hhvm on mw1014 (got stuck, output of hhvm-dump-debug available) [04:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:42:47] (03PS2) 10Muehlenhoff: Update to 4.4.11 [debs/linux44] - 10https://gerrit.wikimedia.org/r/289813 [04:50:24] (03PS3) 10Muehlenhoff: Update to 4.4.11 [debs/linux44] - 10https://gerrit.wikimedia.org/r/289813 [04:53:18] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 4.4.11 [debs/linux44] - 10https://gerrit.wikimedia.org/r/289813 (owner: 10Muehlenhoff) [05:24:49] (03PS1) 10Muehlenhoff: Add missing CVE ID to changelog [debs/linux44] - 10https://gerrit.wikimedia.org/r/289816 [05:25:27] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add missing CVE ID to changelog [debs/linux44] - 10https://gerrit.wikimedia.org/r/289816 (owner: 10Muehlenhoff) [05:46:54] 06Operations: rebase librsvg security fixes - https://phabricator.wikimedia.org/T135804#2311589 (10MoritzMuehlenhoff) [05:50:13] PROBLEM - Host cp3032 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:13] PROBLEM - Host cp3035 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:13] PROBLEM - Host cp3036 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:22] cannot access meta [05:50:33] PROBLEM - Host cp3033 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:33] PROBLEM - Host cp3038 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:33] PROBLEM - Host cp3039 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:43] PROBLEM - Host cp3040 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:43] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:43] PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:44] PROBLEM - Host cp3045 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:44] PROBLEM - Host cp3008 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:44] PROBLEM - Host cp3004 is DOWN: PING CRITICAL - Packet loss = 100% [05:50:44] PROBLEM - Host cp3049 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:02] PROBLEM - Host 2620:0:862:1:91:198:174:122 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:1:91:198:174:122 [05:51:03] PROBLEM - Host 2620:0:862:1:91:198:174:106 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:1:91:198:174:106 [05:51:03] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:03] PROBLEM - Host cp3031 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:03] PROBLEM - Host cp3007 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:14] PROBLEM - Host 91.198.174.106 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:16] PROBLEM - Host misc-web-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:51:22] PROBLEM - Host cp3043 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:22] PROBLEM - Host ns2-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::e [05:51:22] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:22] PROBLEM - Host cp3037 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:22] PROBLEM - Host cp3046 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:22] PROBLEM - Host cp3047 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:22] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:23] PROBLEM - Host cp3034 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:34] PROBLEM - Host cp3044 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:42] PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:42] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:42] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:42] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:42] PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100% [05:51:42] PROBLEM - Host lvs3004 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:42] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:43] PROBLEM - Host lvs3002 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:43] PROBLEM - Host cp3041 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:56] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::1 [05:51:57] PROBLEM - Host bast3001 is DOWN: PING CRITICAL - Packet loss = 100% [05:51:57] PROBLEM - Host asw-esams.mgmt.esams.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [05:51:57] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:02] cannot access phabricator, but can access gerrit :o [05:52:02] PROBLEM - Host cp3042 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:03] PROBLEM - Host nescio is DOWN: PING CRITICAL - Packet loss = 100% [05:52:16] PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::2:b [05:52:17] PROBLEM - Host cp3030 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:17] PROBLEM - Host wikidata is DOWN: PING CRITICAL - Packet loss = 100% [05:52:17] PROBLEM - Host cp3005 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:22] PROBLEM - Host cp3021 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:22] PROBLEM - Host cp3016 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:22] PROBLEM - Host cp3018 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:22] PROBLEM - Host cp3015 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:22] PROBLEM - Host cp3013 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:22] PROBLEM - Host cp3022 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:22] PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:23] PROBLEM - Host cp3017 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:23] PROBLEM - Host cr2-knams is DOWN: PING CRITICAL - Packet loss = 100% [05:52:24] PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [05:52:24] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [05:52:25] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [05:52:45] PROBLEM - Host upload-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:52:49] PROBLEM - Host misc-web-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::3:d [05:53:03] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [05:53:29] Nikerabbit: phabricator is up for me. [05:53:52] PROBLEM - Host 91.198.174.122 is DOWN: PING CRITICAL - Packet loss = 100% [05:53:53] PROBLEM - Host csw2-esams.mgmt.esams.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [05:55:26] I'm here [05:55:33] PROBLEM - Host mr1-esams IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ffff::1 [05:55:35] <_joe_> bblack: I'm taking out esams ASAP [05:55:37] me too. I guess joe is [05:55:49] (03PS1) 10BBlack: Depool esams [dns] - 10https://gerrit.wikimedia.org/r/289817 [05:55:49] working on it, I was gonna say but he said it himself [05:55:51] <_joe_> shit gerrit doesn't work [05:56:03] <_joe_> bblack: thanks, I am out of gerrit [05:56:09] (03CR) 10BBlack: [C: 032 V: 032] Depool esams [dns] - 10https://gerrit.wikimedia.org/r/289817 (owner: 10BBlack) [05:56:09] !log Killed transaction 3262258 on db1040 (alter table stuck in "Waiting for table metadata lock" blocking the replica) T130692 [05:56:10] T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692 [05:56:12] PROBLEM - Host cr2-knams IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ffff::4 [05:56:12] PROBLEM - Host cr1-esams IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ffff::5 [05:56:12] PROBLEM - Host cr2-esams IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ffff::3 [05:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:57:26] I'm trying to get into esams direct now, to kill dns there too [05:57:54] <_joe_> bblack: I can get in I guess [05:57:56] <_joe_> let me try [05:58:13] I'm in now [05:58:36] !log gdnsd stopped on eeden.esams, puppet disabled [05:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:58:50] <_joe_> bblack: lol I did the same things :P [06:01:38] (03CR) 10Giuseppe Lavagetto: "Hey... who do you think you're talking to, people? :)" [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto) [06:02:13] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6 [06:02:13] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6 [06:02:13] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:02:14] PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6 [06:02:14] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:02:22] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:02:22] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:02:23] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 92 connecting: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6, cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6, cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3 [06:02:33] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:02:34] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:02:45] and now we get the related ipsec spam :) [06:02:52] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:02:52] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:02:53] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 92 connecting: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6, cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6, cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3 [06:02:54] odd to see it and not be my fault. [06:03:01] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6 [06:03:02] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:03:02] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6 [06:03:02] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6 [06:03:03] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:03:11] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 92 connecting: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6, cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6, cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3 [06:03:11] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:03:12] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:03:21] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 92 connecting: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6, cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6, cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3 [06:03:21] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6 [06:03:23] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6 [06:03:31] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:03:31] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:03:42] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6 [06:03:42] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:03:42] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:03:42] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:03:42] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6 [06:03:53] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:03:53] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6 [06:03:53] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:03:53] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6 [06:04:02] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:04:02] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:04:02] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 92 connecting: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6, cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6, cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3 [06:04:22] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:04:22] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:04:22] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:04:33] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:04:33] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:04:33] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:04:33] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6 [06:04:51] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:04:51] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:05:12] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 92 connecting: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6, cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6, cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3 [06:05:22] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:05:22] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:05:22] RECOVERY - MariaDB Slave Lag: s4 on db1040 is OK: OK slave_sql_lag Replication lag: 0.73 seconds [06:05:41] PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6 [06:05:41] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 28 not-conn: cp3030_v4, cp3030_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3033_v4, cp3033_v6, cp3040_v4, cp3040_v6, cp3041_v4, cp3041_v6, cp3042_v4, cp3042_v6, cp3043_v4, cp3043_v6 [06:05:41] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6 [06:05:41] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:05:41] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:05:41] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:06:11] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6 [06:06:11] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:06:11] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 [06:06:53] ACKNOWLEDGEMENT - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3007_v4, cp3007_v6, cp3008_v4, cp3008_v6, cp3009_v4, cp3009_v6, cp3010_v4, cp3010_v6 Brandon Black esams link down [06:06:53] ACKNOWLEDGEMENT - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6 Brandon Black esams link down [06:06:53] ACKNOWLEDGEMENT - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp3003_v4, cp3003_v6, cp3004_v4, cp3004_v6, cp3005_v4, cp3005_v6, cp3006_v4, cp3006_v6 Brandon Black esams link down [06:06:53] ACKNOWLEDGEMENT - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 Brandon Black esams link down [06:06:53] ACKNOWLEDGEMENT - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3034_v4, cp3034_v6, cp3035_v4, cp3035_v6, cp3036_v4, cp3036_v6, cp3037_v4, cp3037_v6, cp3038_v4, cp3038_v6, cp3039_v4, cp3039_v6, cp3044_v4, cp3044_v6, cp3045_v4, cp3045_v6, cp3046_v4, cp3046_v6, cp3047_v4, cp3047_v6, cp3048_v4, cp3048_v6, cp3049_v4, cp3049_v6 Brandon Black esams link down [06:08:06] RECOVERY - Host cp3010 is UP: PING WARNING - Packet loss = 80%, RTA = 86.18 ms [06:08:06] RECOVERY - Host cp3038 is UP: PING WARNING - Packet loss = 80%, RTA = 83.94 ms [06:08:06] RECOVERY - Host cp3034 is UP: PING OK - Packet loss = 16%, RTA = 84.17 ms [06:08:06] RECOVERY - Host cp3041 is UP: PING OK - Packet loss = 16%, RTA = 83.73 ms [06:08:06] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [06:08:11] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [06:08:11] RECOVERY - Host cp3017 is UP: PING OK - Packet loss = 0%, RTA = 84.83 ms [06:08:12] RECOVERY - Host cp3033 is UP: PING OK - Packet loss = 0%, RTA = 83.17 ms [06:08:12] RECOVERY - Host cp3008 is UP: PING OK - Packet loss = 0%, RTA = 83.67 ms [06:08:12] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 84.31 ms [06:08:12] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 84.56 ms [06:08:26] heh [06:08:36] lol [06:09:12] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 24 ESP OK [06:09:12] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 44 ESP OK [06:09:13] RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 24 ESP OK [06:09:13] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 83.67 ms [06:09:14] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 36 ESP OK [06:09:14] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [06:09:15] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [06:09:15] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [06:09:16] RECOVERY - Host 91.198.174.122 is UP: PING OK - Packet loss = 0%, RTA = 83.35 ms [06:09:18] RECOVERY - Host upload-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 86.70 ms [06:09:22] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.96 ms [06:09:22] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [06:09:22] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [06:09:22] RECOVERY - Host 2620:0:862:1:91:198:174:106 is UP: PING OK - Packet loss = 0%, RTA = 84.42 ms [06:09:23] RECOVERY - Host wikidata is UP: PING OK - Packet loss = 0%, RTA = 83.17 ms [06:09:32] RECOVERY - Host asw-esams.mgmt.esams.wmnet is UP: PING OK - Packet loss = 0%, RTA = 89.42 ms [06:09:33] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [06:09:33] RECOVERY - Host 91.198.174.106 is UP: PING OK - Packet loss = 0%, RTA = 83.60 ms [06:09:44] RECOVERY - Host csw2-esams.mgmt.esams.wmnet is UP: PING OK - Packet loss = 0%, RTA = 85.21 ms [06:09:44] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [06:09:44] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK [06:09:44] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [06:09:44] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [06:09:44] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [06:09:45] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 36 ESP OK [06:09:52] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [06:09:52] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [06:10:03] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [06:10:07] RECOVERY - Host misc-web-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.39 ms [06:10:07] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [06:10:07] RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 24 ESP OK [06:10:07] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [06:10:07] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [06:10:07] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [06:10:08] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [06:10:08] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [06:10:09] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [06:10:09] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 36 ESP OK [06:10:10] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 36 ESP OK [06:10:10] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [06:10:12] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 84.42 ms [06:10:23] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [06:10:23] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [06:10:23] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [06:10:24] <_joe_> Wikimedia Platform operations, serious stuff | Status: partial outage in Amsterdam DC | Log: https://bit.ly/wikitech | Channel logs: http://ur1.ca/edq22 | Ops Clinic Duty: _joe_ [06:10:32] RECOVERY - Host ns2-v6 is UP: PING OK - Packet loss = 0%, RTA = 83.67 ms [06:11:36] !log restarted gdnsd on eeden.esams (with new config, esams marked down) [06:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:12:52] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: puppet fail [06:12:52] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail [06:12:52] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [06:12:53] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [06:12:53] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: puppet fail [06:12:53] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Puppet has 3 failures [06:12:53] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [06:13:02] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [06:13:03] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail [06:13:03] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:13:03] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [06:13:13] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [06:13:33] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: puppet fail [06:13:33] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: puppet fail [06:13:42] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [06:13:43] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: puppet fail [06:13:43] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [06:13:43] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 3 failures [06:13:43] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: puppet fail [06:13:52] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: puppet fail [06:13:52] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail [06:13:52] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail [06:13:52] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Puppet has 3 failures [06:13:53] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [06:14:03] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: puppet fail [06:14:03] RECOVERY - Host mr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.49 ms [06:14:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:14:14] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Puppet has 3 failures [06:14:14] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: puppet fail [06:14:24] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 3 failures [06:14:24] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail [06:14:25] the puppetfails are just stacked up from the outage, they'll recover on their own [06:14:32] <_joe_> yes [06:14:34] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [06:14:42] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [06:14:42] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: puppet fail [06:14:48] <_joe_> it's pretty obvious they happen given we don't have a local puppetmaster [06:15:00] <_joe_> (which, IMO, we should have) [06:15:22] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 3 failures [06:15:34] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 3 failures [06:15:52] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Puppet has 3 failures [06:19:22] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:19:23] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:20:23] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:21:13] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:21:32] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:22:13] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:22:43] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:24:22] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:24:32] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:24:42] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:25:03] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:25:04] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:25:32] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:26:53] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:27:09] 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2311616 (10Joe) a:05Joe>03None [06:27:44] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:28:24] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:29:44] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:30:34] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:30:43] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:31:22] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:31:54] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:32:44] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:32:52] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:33:20] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add systemd support for the jessie build [debs/nutcracker] - 10https://gerrit.wikimedia.org/r/289603 (owner: 10Giuseppe Lavagetto) [06:34:01] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:02] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:53] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:35:12] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:13] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2311660 (10Joe) @hashar that was exactly my plan [06:36:12] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:22] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:23] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:36:52] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:37:31] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:38:22] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:38:23] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:38:32] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:39:02] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:39:03] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:39:11] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:40:01] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:40:42] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:12] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:32] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:52] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:12] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:22] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail [07:18:34] (03CR) 10Alexandros Kosiaris: [C: 031] "Liking this, it's exactly what we discussed yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto) [07:24:47] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:27:33] (03PS1) 10Jcrespo: Reduce max table lock to identify metadata locks and abort [software] - 10https://gerrit.wikimedia.org/r/289820 (https://phabricator.wikimedia.org/T135809) [07:31:29] (03CR) 10Jcrespo: [C: 032] "I am going to be bold and break a host to test this." [software] - 10https://gerrit.wikimedia.org/r/289820 (https://phabricator.wikimedia.org/T135809) (owner: 10Jcrespo) [07:36:38] !log testing medata lock detectiom on db1069 [07:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:43:09] the test was successful [07:59:54] (03PS1) 10Jcrespo: Revert "mariadb: set is_critical to false for checks" [puppet] - 10https://gerrit.wikimedia.org/r/289821 [08:00:14] (03PS1) 10Jcrespo: Revert "mariadb: set replication check's contact_group to admins" [puppet] - 10https://gerrit.wikimedia.org/r/289822 [08:00:16] (03CR) 10jenkins-bot: [V: 04-1] Revert "mariadb: set is_critical to false for checks" [puppet] - 10https://gerrit.wikimedia.org/r/289821 (owner: 10Jcrespo) [08:00:28] (03PS2) 10Jcrespo: Revert "mariadb: set replication check's contact_group to admins" [puppet] - 10https://gerrit.wikimedia.org/r/289822 [08:02:14] (03PS3) 10Jcrespo: Revert "mariadb: set replication check's contact_group to admins" [puppet] - 10https://gerrit.wikimedia.org/r/289822 (https://phabricator.wikimedia.org/T112473) [08:02:31] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: set replication check's contact_group to admins" [puppet] - 10https://gerrit.wikimedia.org/r/289822 (https://phabricator.wikimedia.org/T112473) (owner: 10Jcrespo) [08:02:39] (03CR) 10Jcrespo: [V: 032] Revert "mariadb: set replication check's contact_group to admins" [puppet] - 10https://gerrit.wikimedia.org/r/289822 (https://phabricator.wikimedia.org/T112473) (owner: 10Jcrespo) [08:03:00] (03PS2) 10Jcrespo: Revert "mariadb: set is_critical to false for checks" [puppet] - 10https://gerrit.wikimedia.org/r/289821 (https://phabricator.wikimedia.org/T112473) [08:03:10] (03PS3) 10Jcrespo: Revert "mariadb: set is_critical to false for checks" [puppet] - 10https://gerrit.wikimedia.org/r/289821 (https://phabricator.wikimedia.org/T112473) [08:07:18] (03PS1) 10Muehlenhoff: Add a new backup set to backup openldap databases [puppet] - 10https://gerrit.wikimedia.org/r/289824 (https://phabricator.wikimedia.org/T120919) [08:08:04] !log mathoid deploying 243a530 [08:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:08:30] (03CR) 10jenkins-bot: [V: 04-1] Add a new backup set to backup openldap databases [puppet] - 10https://gerrit.wikimedia.org/r/289824 (https://phabricator.wikimedia.org/T120919) (owner: 10Muehlenhoff) [08:09:54] (03PS2) 10Muehlenhoff: Add a new backup set to backup openldap databases [puppet] - 10https://gerrit.wikimedia.org/r/289824 (https://phabricator.wikimedia.org/T120919) [08:10:55] 06Operations, 13Patch-For-Review: Add openldap/labs servers to backup - https://phabricator.wikimedia.org/T120919#2311788 (10MoritzMuehlenhoff) [08:10:57] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2311787 (10MoritzMuehlenhoff) [08:11:44] !log upgrading cassandra from 2.1.12 to 2.1.13 on aqs1002.eqiad.mwnet [08:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:12:40] (03PS1) 10Jcrespo: Increase retries to 10 to avoid small bumps to alert [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/289825 (https://phabricator.wikimedia.org/T112473) [08:13:17] (03CR) 10Jcrespo: [C: 032] Increase retries to 10 to avoid small bumps to alert [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/289825 (https://phabricator.wikimedia.org/T112473) (owner: 10Jcrespo) [08:14:39] (03PS4) 10Jcrespo: Revert "mariadb: set is_critical to false for checks" [puppet] - 10https://gerrit.wikimedia.org/r/289821 (https://phabricator.wikimedia.org/T112473) [08:16:43] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: set is_critical to false for checks" [puppet] - 10https://gerrit.wikimedia.org/r/289821 (https://phabricator.wikimedia.org/T112473) (owner: 10Jcrespo) [08:17:21] PROBLEM - cassandra CQL 10.64.32.175:9042 on aqs1002 is CRITICAL: Connection refused [08:17:35] this is me, should resolve in a bit [08:18:30] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection refused [08:23:00] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.004 second response time on port 9042 [08:23:18] gooood [08:23:22] one more to go [08:23:26] (in a bit) [08:24:00] RECOVERY - cassandra CQL 10.64.32.175:9042 on aqs1002 is OK: TCP OK - 0.000 second response time on port 9042 [08:50:35] !log upgrading cassandra from 2.1.12 to 2.1.13 on aqs1003.eqiad.mwnet [08:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:55:11] 06Operations, 07Graphite: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385#2311841 (10fgiunchedi) it looks like metric sending from `cassandra-metrics-collector` keep piling up as they get stalled and never time out the sending, myself and @Eevans have be... [08:57:19] PROBLEM - cassandra CQL 10.64.48.117:9042 on aqs1003 is CRITICAL: Connection refused [08:58:03] node is already up and running, will clear in a sec [08:59:29] RECOVERY - cassandra CQL 10.64.48.117:9042 on aqs1003 is OK: TCP OK - 0.002 second response time on port 9042 [09:00:19] !log altering db1040 commonswiki.categorylinks [09:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:12] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 686 [09:11:15] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2311859 (10fgiunchedi) +1 to what @faidon said, thanks @Gilles indeed for all the effort you've put in this! I'll start with python-statsd as the lowest hanging fruit.... [09:20:12] RECOVERY - check_mysql on lutetium is OK: Uptime: 821598 Threads: 1 Questions: 15018862 Slow queries: 14258 Opens: 92966 Flush tables: 2 Open tables: 64 Queries per second avg: 18.280 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:20:44] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [09:21:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [09:22:03] PROBLEM - puppet last run on mw2200 is CRITICAL: CRITICAL: puppet fail [09:30:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:32:02] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:43:05] (03PS1) 10Gehel: Maps - make redis server configureable [puppet] - 10https://gerrit.wikimedia.org/r/289829 (https://phabricator.wikimedia.org/T134901) [09:45:54] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2311913 (10elukey) @Eevans aqs100[123] upgraded to 2.1.13 today! [09:50:23] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:37] (03PS1) 10Muehlenhoff: Allow CQL access for multi-instance AQS Cassandra setup [puppet] - 10https://gerrit.wikimedia.org/r/289830 [09:58:31] (03PS2) 10Muehlenhoff: Allow CQL access for multi-instance AQS Cassandra setup [puppet] - 10https://gerrit.wikimedia.org/r/289830 [10:01:17] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2311944 (10elukey) Reporting a conversation with dormando on the #memcached Freenode channel: https://phabricator.wikimedia.org/P3153 Comments are about the la... [10:06:58] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [10:08:25] (03CR) 10Gehel: [C: 032] Maps - make redis server configureable [puppet] - 10https://gerrit.wikimedia.org/r/289829 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [10:08:44] (03CR) 10Elukey: [C: 031] "http://puppet-compiler.wmflabs.org/2856" [puppet] - 10https://gerrit.wikimedia.org/r/289830 (owner: 10Muehlenhoff) [10:10:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [10:15:33] they are upload errors, cannot see a pattern [10:17:09] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:17:49] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:36:35] (03PS1) 10Jcrespo: Increase db1029 and db1033 weight back to normal after upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289838 (https://phabricator.wikimedia.org/T112079) [10:36:47] 06Operations, 07HHVM, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2312032 (10Joe) I am now building a package linked to libicu52 for trusty, as the preparation work seems to be done. [10:37:36] (03CR) 10Jcrespo: [C: 032] Increase db1029 and db1033 weight back to normal after upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289838 (https://phabricator.wikimedia.org/T112079) (owner: 10Jcrespo) [10:40:20] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Increase db1029 and db1033 weight back to normal after upgrade (duration: 01m 52s) [10:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:44] (03CR) 10Mobrovac: "Given that Cass instances do not use 9042 for inter-node communication and the fact that the plan is to retire aqqs100[123] soon(TM), I th" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/289830 (owner: 10Muehlenhoff) [11:27:05] (03PS1) 10Faidon Liambotis: Revert "Depool esams" [dns] - 10https://gerrit.wikimedia.org/r/289844 [11:28:26] (03CR) 10Faidon Liambotis: [C: 032] Revert "Depool esams" [dns] - 10https://gerrit.wikimedia.org/r/289844 (owner: 10Faidon Liambotis) [11:45:05] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2312240 (10elukey) Compared chunk size vs number of chunks for the hosts under testing to get a visual difference (I tried to combine the graphs but my spreads... [11:49:49] !log rolling restart of nginx in ulsfo to pick up expat update [11:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:52:07] !Updated cxserver to 4c5738c [11:54:39] RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.037 second response time on port 9042 [12:02:44] !log rolling restart of nginx in esams to pick up expat update [12:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:13:51] 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#2312310 (10jcrespo) [12:13:53] 06Operations, 10DBA, 13Patch-For-Review: db1033 (old s7 master) needs backup and reimage - https://phabricator.wikimedia.org/T134555#2312309 (10jcrespo) 05Open>03Resolved [12:15:34] 06Operations, 10DBA, 13Patch-For-Review: Upgrade x1 cluster - https://phabricator.wikimedia.org/T112079#2312315 (10jcrespo) 05Open>03Resolved After the increase of weight of the slave, all x1 servers should be on jessie and a recent mariadb version. Only regular maintenance would be needed as usual. [12:16:27] !log restarting cassandra on aqs100[123] for Java upgrades [12:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:22:29] 06Operations, 10Traffic, 07HTTPS: Secure connection failed when attempting to send POST request - https://phabricator.wikimedia.org/T134869#2312346 (10Thibaut120094) OTRS members have the same issue with Firefox 46.0.1 on https://ticket.wikimedia.org/ https://lists.wikimedia.org/mailman/private/otrs-fr/2016... [12:22:49] !log rolling restart of nginx in codfw to pick up expat update [12:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:27:19] (03PS3) 10Muehlenhoff: Allow CQL access for multi-instance AQS Cassandra setup [puppet] - 10https://gerrit.wikimedia.org/r/289830 [12:30:00] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [12:31:38] !log rolling restart of nginx in eqiad to pick up expat update [12:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:34:48] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2312370 (10Steinsplitter) New error when deleting (related to this?): ``` API reque... [12:37:04] 06Operations, 10Traffic, 07Browser-Support-Firefox, 07HTTPS: Secure connection failed when attempting to send POST request - https://phabricator.wikimedia.org/T134869#2312371 (10Danny_B) Adding #browser-support-firefox as it seems to be its issue only ATM (no other browser verified yet - in such case, remo... [12:39:27] (03CR) 10Mobrovac: [C: 031] Allow CQL access for multi-instance AQS Cassandra setup [puppet] - 10https://gerrit.wikimedia.org/r/289830 (owner: 10Muehlenhoff) [12:51:07] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312406 (10phuedx) @BBlack: If Varnish is the part of the stack that this is to be done, have you taken a look at [libvmod-abtest](https://github.com/Destination/libvmod-abtest)... [12:51:44] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2312407 (10Gilles) [12:54:02] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2312414 (10Gilles) So far I'm not running into major issues packaging any of the dependencies with their own tests running during the build. The install dependencies a... [12:54:54] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/2857/ - Marko docet" [puppet] - 10https://gerrit.wikimedia.org/r/289830 (owner: 10Muehlenhoff) [12:58:12] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:05:50] !log freeing up space on db1038 by defragmenting its tables [13:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:34] 06Operations, 06Discovery, 10Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Install / configure new maps servers in codfw - https://phabricator.wikimedia.org/T134901#2312419 (10Gehel) Log of osm2pgsql run: {F4034592} Import was done according to [[ https://wikitech.wikimedia.org/wiki/Maps#Importing_... [13:07:38] !log Performing acupuncture on cr2-codfw:ae4.2020 (Lowered VRRP priority from 100 to 50, inet/inet6) [13:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:08] ahahahahahahah [13:14:26] !log Lowering VRRP priority to 50 on all VRRP groups on cr2-codfw to drain FPC0 [13:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:15:45] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 13Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2312454 (10elukey) 05Open>03Resolved [13:24:55] !log Disabling OSPF on all cr2-codfw row subnets to drain FPC0 [13:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:31:05] (03PS3) 10Rush: labstore1003 define scratch share [puppet] - 10https://gerrit.wikimedia.org/r/289774 [13:31:21] !log Disabling cr2-codfw et-0/* interfaces [13:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:35:10] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 84, down: 4, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae2BRae2: down - Core: asw-b-codfw:ae2BRae3: down - Core: asw-c-codfw:ae2BRae4: down - Core: asw-d-codfw:ae2BR [13:35:37] !log changing dbstore1001 to be a direct slave of db1075 [13:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:37:00] !log Offlining cr2-codfw FPC 0 [13:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:37:48] !log upgraded java on xenon/praseodymium/cerium and restbase2001 to latest openjdk-8 release (along with restarts of Cassandra) [13:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:38:03] sleep... [13:38:43] !log Bringing cr2-codfw FPC 0 back up [13:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:41:49] db1056 lag is flapping (in a lagging- depooled- up to date cycle) [13:41:59] but schema change on s4 finished already [13:45:01] bd808: so it looks like the message changes from https://gerrit.wikimedia.org/r/289802 did not get automagically deployed… filters on e.g. https://pl.wikipedia.org/wiki/Specjalna:Linkujące/A are still broken (no links are generated) [13:45:48] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [13:45:55] !log Enabled cr2-codfw et-0/* interfaces, reenabling OSPF/OSPF3 [13:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:46:55] Dereckson: ^^^^ [13:47:24] Dereckson: looking at https://wikitech.wikimedia.org/wiki/Server_Admin_Log , i don't see any evidence that LocalisationUpdate actually ran yesterday [13:47:57] apparently the last time it happened was on 2016-05-18? " 02:59 logmsgbot: mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.2) (duration: 11m 15s)" [13:48:18] disk is ok [13:50:40] ostriches: hi. Yesterday, we synced a Revert "Convert Special:WhatLinksHere from XML form to OOUI form" change, which broked that special page. As the change repurposed messages, we've merged in master a l10n change too, but it hasn't been picked by l10n task. The non English messages aren't consistent with the version. We don't have SWAT windows available today. Could I at 8:00 SF do a sca [13:50:46] p to deploy a cherry pick of https://gerrit.wikimedia.org/r/289802 to wmf.2? [13:51:18] MatmaRex: if I've a green light, I can backport it to wmf/1.28.0-wmf.2 and sync [13:53:07] lots of "Title::invalidateCache" [13:53:59] probably someone doing it from the api [13:55:18] (03PS4) 10Rush: labstore1003 define scratch share [puppet] - 10https://gerrit.wikimedia.org/r/289774 [13:55:35] (03PS5) 10Rush: labstore1003 define scratch share [puppet] - 10https://gerrit.wikimedia.org/r/289774 [13:56:43] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312587 (10BBlack) >>! In T135762#2312406, @phuedx wrote: > @BBlack: If Varnish is the part of the stack that this is to be done, have you taken a look at [libvmod-abtest](https... [14:00:20] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2312610 (10fgiunchedi) >>! In T95253#2306337, @Eevans wrote: > Cassandra has now been downgraded to 2.1.13 on... [14:01:33] (03PS2) 10Giuseppe Lavagetto: cassandra: pin cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) [14:01:58] <_joe_> elukey, urandom, gehel ^^ [14:02:52] <_joe_> actually, I think some of the things I put there are redundant [14:03:29] (03PS3) 10Giuseppe Lavagetto: cassandra: pin cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) [14:03:48] <_joe_> please review :) [14:03:59] ack! [14:04:23] <_joe_> !log removing libicu48 from trusty archives, kept a copy of the packages in my homedir on carbon [14:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:31] (03CR) 10Gehel: [C: 031] cassandra: pin cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto) [14:06:38] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2312644 (10fgiunchedi) all instances have been bootstrapped, left to do: * deploy restbase on restbase200[789] if not already * add restbase200[789] to conftool and pool them in lvs [14:08:12] (03CR) 10Gehel: cassandra: pin cassandra version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto) [14:12:07] (03CR) 10Giuseppe Lavagetto: cassandra: pin cassandra version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto) [14:20:14] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/2859/" [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto) [14:22:31] !log cassandra downgraded on maps2*.codfw.wmnet [14:22:34] _joe_: ^ [14:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:16] <_joe_> gehel: thanks :) [14:23:50] <_joe_> ok merging this [14:26:01] (03CR) 10Giuseppe Lavagetto: [C: 032] cassandra: pin cassandra version [puppet] - 10https://gerrit.wikimedia.org/r/289683 (https://phabricator.wikimedia.org/T135749) (owner: 10Giuseppe Lavagetto) [14:30:10] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2312715 (10elukey) Finally the Snapshots: mc1007 - growth factor 1.15 - memcached 1.4.21 [[ https://phabricator.wikimedia.org/P3129 | mc1007_stats_1463649014... [14:36:19] jynus, do we do anything in production to preserve the MySQL AUTO_INCREMENT counter when the database/database server restarts? See https://phabricator.wikimedia.org/T122262#2310857 . [14:37:40] (03PS1) 10Gergő Tisza: Enable AuthManager in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289866 (https://phabricator.wikimedia.org/T135498) [14:37:48] matt_flaschen, no [14:38:00] matt_flaschen, does it affect you? [14:39:33] jynus, in production, probably not, unless we're amazingly unlucky. [14:39:39] your problem is deleting rows, who in full sanity does delete rows or move them around? [14:40:00] the whole "archival" of rows is broken [14:40:09] and should never be done [14:40:37] deleting rows in unefficient, if you can just mark them as deleted [14:41:36] is that your model, similar to archive? [14:41:39] jynus, Flow doesn't specifically (we mark stuff as deleted), but as you know core does for page and revision. [14:41:48] I know [14:41:57] and that is a very broken behaviour [14:42:05] that causes slaves to desync [14:42:18] INSERT...SELECT is the worst thing ever [14:42:28] Also, I didn't realize before yesterday the same auto_increment key could be handed out twice in certain scenarios. [14:42:31] That's why I asked. [14:42:55] yeah, a master failover [14:43:59] I mean, we could workaround it [14:44:13] but I would prefer to fix the code to not depend on it [14:45:38] I agree, but I don't have time to fix that in core right now. I'm not asking you to work on it in the operations side either (since I think it's super-unlikely to cause problems in practice), I was just wondering. [14:45:45] I will file a bug just to track it, though. [14:46:08] matt_flaschen, sure, I wasn't suggesting that you did :-) [14:46:27] but it is something that worries me [14:47:01] and can break replication, too [14:48:29] 06Operations, 10cassandra: Downgrade Cassandra on apt.wikimedia.org to 2.1.13 - https://phabricator.wikimedia.org/T135673#2312806 (10Joe) [14:50:32] !log shutting down kartotherian on maps-test2001 (accidental data deletion) [14:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:59] PROBLEM - cassandra service on maps-test2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [14:52:28] PROBLEM - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: Connection refused [14:54:15] MatmaRex: crud. Checked the logs and found failures. I opened T135849 to investigate [14:54:15] T135849: l10nupdate failing due to sudo rights - https://phabricator.wikimedia.org/T135849 [14:54:50] thcipriani: ^ do you have time to figure out how to fix l10nupdate? [14:55:09] * thcipriani looks [14:55:40] It looks like at least the sudo grant is messed up [14:56:11] also I thought I had fixed it at some point so it would !log on failure but apparently not [14:56:11] bd808: yup. I can take a look. [14:56:34] <3 thcipriani. Shout if you need help [14:56:55] will do (really most likely will do :)) [14:58:17] (03PS5) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [14:58:26] 06Operations, 10ops-codfw, 10cassandra, 13Patch-For-Review: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2312843 (10Eevans) [14:58:31] ah, crap. I think I know what's happening. [14:59:42] urandom: I'm going to roll-restart cmcd to test the theory behind T135385 and see if that makes it recover [14:59:42] T135385: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385 [15:00:00] godog: go for it [15:01:31] (03PS1) 10Alex Monk: Attempt to fix dynamicproxy-api service [puppet] - 10https://gerrit.wikimedia.org/r/289870 [15:02:42] !log uploaded librsvg 2.40.5-1+deb8u2+wmf1 for jessie-wikimedia to carbon (rebase of locally patched package on top of latest security update) [15:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:53] (03CR) 10Alex Monk: "Yuvi, is proxy-eqiad.wmflabs.org working perhaps because of an unpuppetised version of that file?" [puppet] - 10https://gerrit.wikimedia.org/r/289870 (owner: 10Alex Monk) [15:03:05] !log roll-restart cassandra-metrics-collector in codfw for T135385 [15:03:07] T135385: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385 [15:03:11] RECOVERY - cassandra service on maps-test2001 is OK: OK - cassandra is active [15:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:43] !log starting cluster rejoining for cassandra onmaps-test2001 [15:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:57] 06Operations, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312863 (10Mattflaschen-WMF) [15:06:18] 06Operations, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312876 (10Mattflaschen-WMF) p:05Triage>03Low [15:07:31] 06Operations, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312863 (10Mattflaschen-WMF) [15:10:40] bd808: :/ [15:10:55] (03PS6) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [15:11:32] poor l10nupdate gets broken for weeks at a time too often. [15:11:55] a sign that either a) we don't care or b) the weekly train makes it less useful than it once was [15:12:01] (03PS1) 10Andrew Bogott: Remove labvirt1003 from the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/289871 (https://phabricator.wikimedia.org/T135850) [15:12:03] (03PS1) 10Andrew Bogott: Nova: Decrease disk_allocation_ratio [puppet] - 10https://gerrit.wikimedia.org/r/289872 [15:12:05] (03CR) 1020after4: keyholder key cleanup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [15:13:20] bd808: if you and thciprian.i manage to fix it today, it'd be good to run it. that's probably a nicer thing to do than backport and deploy the patch on friday. :) [15:13:31] (03PS7) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [15:13:48] (03CR) 10Andrew Bogott: [C: 032] Remove labvirt1003 from the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/289871 (https://phabricator.wikimedia.org/T135850) (owner: 10Andrew Bogott) [15:14:27] 06Operations, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312900 (10Mattflaschen-WMF) [15:16:07] !log roll-restart cassandra-metrics-collector in eqiad for T135385 [15:16:08] T135385: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385 [15:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:43] (03PS1) 10Thcipriani: Fix suoders permissions for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/289874 [15:19:55] ^ bd808 ought to fix it. [15:21:15] (03PS2) 10BryanDavis: Fix suoders permissions for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/289874 (https://phabricator.wikimedia.org/T135849) (owner: 10Thcipriani) [15:21:30] (03CR) 10BryanDavis: [C: 031] Fix suoders permissions for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/289874 (https://phabricator.wikimedia.org/T135849) (owner: 10Thcipriani) [15:22:30] <_joe_> bd808: need me to take a look? [15:22:40] _joe_: that would be awesome [15:22:59] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix suoders permissions for l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/289874 (https://phabricator.wikimedia.org/T135849) (owner: 10Thcipriani) [15:23:02] I was just trying to decide if it was too late in your day for a ping :) [15:23:34] If you could force a run of that on tin it would be greatly appreciated [15:23:42] <_joe_> it will take ~ half an hour to complete. [15:23:49] *nod* [15:23:54] It seems https://integration.wikimedia.org/zuul/ is backed up [15:24:04] because https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/43601/console has frozen [15:24:13] Once completed, is there a way to manually trigger the l10nupdate task? [15:24:39] Dereckson: yeah, any deployer can run the script on tin [15:24:45] Could someone abort https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/43601/console please. [15:25:09] <_joe_> paladox: why? [15:25:09] paladox: probably a better conversation for #wikimedia-releng [15:25:19] <_joe_> yeah, that too [15:25:19] _joe_ because it is frozen [15:25:25] godog: verdict? [15:25:37] Ok [15:26:04] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312919 (10Jdlrobson) This all sounds great and I love that its generic and can be reused again! A few clarifications - if I'm understanding correctly experiments would be conf... [15:27:53] 06Operations, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312924 (10jcrespo) -1 disagreeing with the solution. This can happen also on master failover-which your solution will not protect against. The right fix is not a complex server-side... [15:28:16] 06Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312929 (10jcrespo) [15:32:07] urandom: yeah it looks like stall/drops decrease but likely the pause between stop/start isn't long enough to fully drain the queue, was worth a try! [15:33:02] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312939 (10dr0ptp4kt) Quick question: how does this guarantee bucketing across browser restart? [15:34:09] 06Operations, 07HHVM, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2312962 (10Joe) >>! In T86096#2186298, @Joe wrote: > Since we are at the point where there are no precise machines left running php, we should really build HHVM with libicu52 and m... [15:34:20] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312967 (10Nuria) Non session cookies are kept after browser restarts, with an expiration set of 30 days (like last access cookie) the cookie is available. [15:35:40] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312971 (10dr0ptp4kt) I should note in the concrete cases: persistence across browser restart is probably not as important for lazy loaded images, whereas persistence across bro... [15:36:13] godog: seems like it would take a stall to have them start piling up in the first place, though [15:36:41] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312972 (10dr0ptp4kt) @nuria, should we add that to the Description as acceptance criteria? [15:37:10] godog: granted, that will on exacerbate matters, but it sounds like there is still something going on carbon-side [15:38:27] ACKNOWLEDGEMENT - cassandra CQL 10.192.0.128:9042 on maps-test2001 is CRITICAL: Connection refused Gehel data import in progress [15:38:33] s/will on/will only/ [15:38:42] urandom: yeah I'm not 100% convinced either there isn't something else going on, the periodicity is oddly precise [15:39:10] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312984 (10BBlack) >>! In T135762#2312919, @Jdlrobson wrote: > A few clarifications - if I'm understanding correctly experiments would be configured in puppet? That would be th... [15:44:19] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2312986 (10BBlack) >>! In T135762#2312939, @dr0ptp4kt wrote: > Quick question: how does this guarantee bucketing across browser restart? >>! In T135762#2312967, @Nuria wrote: >... [15:51:50] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2313004 (10Nuria) >Since the binning is done independently of actual experiments (the binning is live all the time for all cookie-enabled agents), this actually is a problem, I... [15:52:54] !log performing schema change on s5 T130692 [15:52:55] T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692 [15:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:23] (03PS2) 10Andrew Bogott: Nova: Decrease disk_allocation_ratio [puppet] - 10https://gerrit.wikimedia.org/r/289872 [16:11:28] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.23) (duration: 20990m 58s) [16:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:36] ^ lol [16:11:43] (03CR) 10Andrew Bogott: [C: 032] Nova: Decrease disk_allocation_ratio [puppet] - 10https://gerrit.wikimedia.org/r/289872 (owner: 10Andrew Bogott) [16:11:57] that was a hung ssh for a l10nupdate that I jsut killed on tin [16:12:56] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2313016 (10BBlack) Right, but if the user deletes cookies or goes incognito, that's probably a rare event for most, and possibly associated with not re-using browser cache acros... [16:14:31] bd808: that's a lot of hours ago [16:14:43] 1.27.0-wmf.23 is a clue :) [16:15:33] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: Add kartik to analytics-privatedata-users group - https://phabricator.wikimedia.org/T135704#2307853 (10madhuvishy) Noting that analytics-privatedata-users also gives Hadoop access... [16:15:39] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2313020 (10Krinkle) I think 1-100 might be a bit small. Especially considering our scale and considering most of our experiments will not have been load tested very much. For i... [16:18:36] !log kicking off manual l10nupdate run on tin [16:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:36] (03PS3) 10Nuria: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) [16:22:02] PROBLEM - nova-compute process on labvirt1010 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [16:22:13] (03CR) 10jenkins-bot: [V: 04-1] Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria) [16:27:03] (03CR) 10Ottomata: Cloning analytics.wikimedia.org repo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria) [16:27:24] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2313080 (10BBlack) >>! In T135762#2313020, @Krinkle wrote: > [...] > An experiment could start at 1 bucket (0.01%) and work its way up to 10 (0.1%). And if the experiment no lon... [16:31:14] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2313083 (10Nuria) >For comparison, our entire Navigation Timing data used to be based on 0.01% sampling. It is now tuned up to 0.1% (1:1000 sample; >$wgNavigationTimingSamplingF... [16:32:32] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [16:38:29] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2313093 (10BBlack) >>! In T135762#2313083, @Nuria wrote:. > I know you know this but just clarifying that we do not have these restrictions here though, the restrictions come f... [16:38:44] (03PS4) 10Nuria: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) [16:40:47] !log bd808@tin scap sync-l10n completed (1.28.0-wmf.2) (duration: 10m 06s) [16:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:14] Dereckson: ^ do you know which messages to check? [16:41:43] * Dereckson is looking. [16:41:46] (03CR) 10Thcipriani: "Looking good, moves in the direction I think we'd like to go." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [16:42:05] Danny_B: ping? [16:42:30] Danny_B: does https://cs.wiktionary.org/wiki/Speci%C3%A1ln%C3%AD:Co_odkazuje_na?target=Modul%3AQuote%2Ftools&namespace=10&title=Speci%C3%A1ln%C3%AD%3ACo_odkazuje_na look good to you? [16:45:29] there are at least links there (so the $1 is back in the message) [16:45:34] https://pl.wikipedia.org/wiki/Specjalna:Linkuj%C4%85ce/A has them too [16:46:25] bd808: https://gerrit.wikimedia.org/r/#/c/289802/ is well put into consideration, yes [16:46:51] RECOVERY - nova-compute process on labvirt1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [16:46:51] (03PS5) 10Nuria: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) [16:49:41] PROBLEM - Getent speed check on labstore1001 is CRITICAL: CRITICAL: getent group tools.admin failed [16:50:26] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri May 20 16:50:25 UTC 2016 (duration 9m 38s) [16:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:53:33] Dereckson: yes [16:54:01] (03PS8) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [16:54:18] hello! can anyone here tell me how client connections and varnish return codes are reported to graphana? https://grafana.wikimedia.org/dashboard/db/client-connections [16:54:47] !log Cleaned up /tmp/mw-cache-1.27.0-wmf.2* cache files on tin [16:54:53] and https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes [16:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:08] are those reported directly from varnish into statsd? [16:56:32] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:57:18] nuria_: those and several similar stats flow from varnishd -> VSM -> (various python scripts) -> statsd [16:57:52] (well, in the client-connections and tls-ciphers cases, from nginx -> varnishd -> ...) [16:57:59] bblack: aham ..'vsm' stands out for ...? [16:58:16] shared memory log, which varnishd writes to and the various daemons read from [16:58:55] the relatively-new https://grafana-admin.wikimedia.org/dashboard/db/varnish-caching comes from that sort of pipeline, too [17:00:37] bblack: i see, as far as i can see there are no varnish stats per endpoint as in 'restbase apis' versus 'english wikipedia', correct? [17:00:45] (03CR) 10jenkins-bot: [V: 04-1] Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria) [17:00:59] does anyone know about exim mail routers. specifically, in something like "templates/exim/exim4.conf.mx.erb:route_list = * magnesium.wikimedia.org byname", i know that can be a list of hosts and not just one, but can we add it in a way that doesnt just make it a backup or randomizes but makes it always route it to BOTH at the same time? i have looked at docs but it's all about trying one from the list until one works and then stopping it s [17:01:04] nuria_: no, we don't log sort of detail, via these sorts of pipelines [17:01:24] nuria_: we do have some per-backend stats that can filter RB-vs-MW and such, but that's on the backside of the caches, not the front. [17:01:38] (03PS6) 10Nuria: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) [17:01:43] nuria_: and obviously, upload.wm.o and maps.wm.o have distinct clusters from the primary wikis, so they're inherently separated [17:01:54] bblack: got it thank you [17:02:00] if that cut off, somebody said there is an "unseen" router option that makes it possible to continue with a second router as if the first never happened or so? [17:05:46] !log uploaded librsvg 2.40.2-1+wm2 for trusty-wikimedia to carbon (backported patches from librsvg DSA to our custom trusty build) [17:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:06:50] 06Operations: rebase librsvg security fixes - https://phabricator.wikimedia.org/T135804#2313125 (10MoritzMuehlenhoff) 05Open>03Resolved Uploaded librsvg 2.40.2-1+wm2 for trusty-wikimedia to carbon (backported patches from librsvg DSA to our custom trusty build) Uploaded librsvg 2.40.5-1+deb8u2+wmf1 for jessi... [17:09:19] RECOVERY - cassandra CQL 10.192.0.128:9042 on maps-test2001 is OK: TCP OK - 0.037 second response time on port 9042 [17:21:59] (03PS3) 10Dzahn: exim: route mail for RT to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/288721 (https://phabricator.wikimedia.org/T119112) [17:23:56] (03PS4) 10Dzahn: exim: route mail for RT to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/288721 (https://phabricator.wikimedia.org/T119112) [17:27:58] (03PS1) 10BryanDavis: l10nupdate: Stop using deprecated refreshCdbJsonFiles script [puppet] - 10https://gerrit.wikimedia.org/r/289886 [17:30:02] (03CR) 10BryanDavis: "I looked and didn't see any explicit sudo grants related to refreshCdbJsonFiles that would also need to be changed. The current script wil" [puppet] - 10https://gerrit.wikimedia.org/r/289886 (owner: 10BryanDavis) [17:32:10] (03PS5) 10Dzahn: exim: route mail for RT to ununpentium [puppet] - 10https://gerrit.wikimedia.org/r/288721 (https://phabricator.wikimedia.org/T119112) [17:44:27] (03PS7) 10Ottomata: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria) [17:48:45] (03CR) 10Rush: [C: 032] labstore1003 define scratch share [puppet] - 10https://gerrit.wikimedia.org/r/289774 (owner: 10Rush) [17:49:28] (03PS8) 10Ottomata: Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria) [17:49:35] (03CR) 10Ottomata: [C: 032 V: 032] Cloning analytics.wikimedia.org repo [puppet] - 10https://gerrit.wikimedia.org/r/289676 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria) [17:50:51] (03PS3) 10Dzahn: install_server: split out reprepro role [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [17:51:24] (03CR) 10Dzahn: install_server: split out reprepro role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [17:53:34] (03CR) 10Dzahn: "yes, doing that and changing the module name, just for the role class, if it stays "role foo::bar" instead of "role foo" it can be moved t" [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [17:59:39] (03PS9) 10Rush: labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 [18:03:56] (03PS4) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [18:05:38] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [18:07:22] (03PS5) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [18:07:47] re: icinga-wm: it cant find host labstore2004 [18:07:55] i would say normal if that is a fresh install, otherwise not [18:08:00] runs puppet again [18:08:08] (03PS10) 10Rush: labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 [18:09:28] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [18:11:00] (03CR) 10Rush: [C: 032] labstore nfs introduce nfs_mount defined type [puppet] - 10https://gerrit.wikimedia.org/r/289727 (owner: 10Rush) [18:11:58] mutante: that host was installed ...a month or so ago [18:12:03] not sure what the deal is but it's at least not new [18:13:32] chasemp: hmm, ok. i'm looking for changes [18:15:49] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:29:14] (03PS1) 10Rush: labstore200[1-4] add standard in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/289891 [18:30:14] mutante: labstore2* were getting standard from a role which is not ideal as role changes can then cause this, I had added to eqiad things to compensate but didn't realize fileserver was already on labstore2 [18:30:37] it shouldn't be (the same role on both, that is) but that's for another day [18:31:31] agreed I'll deal w/ it later [18:31:43] it's all badly arranged [18:32:38] chasemp: gotcha! yea, this all makes sense then. standard adds to icinga. it removed them from icinga. that doesnt happen in a single run.. thats why errors.. [18:32:57] thanks, yep [18:33:57] (03CR) 10Rush: [C: 032] labstore200[1-4] add standard in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/289891 (owner: 10Rush) [18:35:49] (03PS1) 10Andrew Bogott: Don't include labsdb hosts in the 'labs' group [puppet] - 10https://gerrit.wikimedia.org/r/289892 [18:39:42] (03CR) 10Dzahn: [C: 031] ":) yea, that explains what i saw on neon! thanks" [puppet] - 10https://gerrit.wikimedia.org/r/289892 (owner: 10Andrew Bogott) [18:40:29] (03CR) 10Andrew Bogott: [C: 032] Don't include labsdb hosts in the 'labs' group [puppet] - 10https://gerrit.wikimedia.org/r/289892 (owner: 10Andrew Bogott) [18:54:54] (03PS1) 10Rush: labstore::fileserver remove use_ldap option [puppet] - 10https://gerrit.wikimedia.org/r/289898 (https://phabricator.wikimedia.org/T126083) [18:55:38] (03PS2) 10Rush: labstore::fileserver remove use_ldap option [puppet] - 10https://gerrit.wikimedia.org/r/289898 (https://phabricator.wikimedia.org/T126083) [19:03:47] (03PS6) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [19:11:34] (03PS7) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [19:12:42] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [19:13:22] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1676 bytes in 0.217 second response time [19:14:30] (03CR) 10Rush: [C: 032] labstore::fileserver remove use_ldap option [puppet] - 10https://gerrit.wikimedia.org/r/289898 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [19:19:23] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1670 bytes in 0.205 second response time [19:22:17] (03PS1) 10Rush: labs nfs scratch and dumps to soft mode [puppet] - 10https://gerrit.wikimedia.org/r/289903 (https://phabricator.wikimedia.org/T126083) [19:22:32] (03PS2) 10Rush: labs nfs scratch and dumps to soft mode [puppet] - 10https://gerrit.wikimedia.org/r/289903 (https://phabricator.wikimedia.org/T126083) [19:30:46] jenkins on vaca? [19:33:59] (03CR) 10Rush: [C: 032 V: 032] labs nfs scratch and dumps to soft mode [puppet] - 10https://gerrit.wikimedia.org/r/289903 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [19:46:57] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 1 failures [19:50:44] 06Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2313704 (10Mattflaschen-WMF) Yeah, in principle I support fixing it in core as described. Your proposed solution is similar to RevisionDelete (which is already in core), but... [19:52:31] 06Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2313718 (10Mattflaschen-WMF) [19:52:45] 06Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312863 (10Mattflaschen-WMF) [19:59:17] (03CR) 10Thcipriani: [C: 031] "Good catch—thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/289886 (owner: 10BryanDavis) [20:11:36] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [20:15:35] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2313733 (10scfc) [20:36:02] !log restart rabbitmq on labcontrol1001 [20:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:57] (03PS9) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [21:00:16] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [21:05:16] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [21:10:16] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [21:17:59] (03PS1) 10Eevans: Update collector version (both branches) [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/289963 [21:24:57] (03PS1) 10Rush: wip: labstore cleanup and role vs module arrange [puppet] - 10https://gerrit.wikimedia.org/r/289964 [21:28:44] (03PS1) 10Eevans: Updated cassandra-metrics-collector version(s) [puppet] - 10https://gerrit.wikimedia.org/r/289965 [21:31:23] (03PS8) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [21:34:13] (03PS2) 10Rush: wip: labstore cleanup and role vs module arrange [puppet] - 10https://gerrit.wikimedia.org/r/289964 [22:25:23] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2314347 (10bd808) @Steinsplitter reported to me on irc that > for protocol relative urls in mwclient, scheme='https' must be set in the config to enable https o... [22:51:31] 06Operations, 10Mail: administrative rights for GLAM@ - https://phabricator.wikimedia.org/T135874#2314374 (10Krenair) I believe ops have a cron set up on the mail servers to send a copy of the (private) @wikimedia.org aliases file to officeit@wikimedia.org every week. [23:02:12] 06Operations, 10Mail: administrative rights for GLAM@ - https://phabricator.wikimedia.org/T135874#2314378 (10eliza) Hmm.....yes - I actually checked the log and an alias glam@wikimedia.org does not exist. What's curious is that I sent a test to glam@wikimedia.org and did not receive a bounce back or reply. I... [23:02:54] 06Operations, 10Mail: administrative rights for GLAM@ - https://phabricator.wikimedia.org/T135874#2313706 (10Dzahn) **glam@wikimedia.org** does not appear in the alias file controlled by ops. the mail server tells me it's controlled by OTRS [mx1001:~] $ sudo exim4 -bt glam@wikimedia.org glam@wikimedia.org... [23:04:13] 06Operations, 10Mail: administrative rights for GLAM@ - https://phabricator.wikimedia.org/T135874#2314383 (10Dzahn) @eliza mail to glam@ goes into https://meta.wikimedia.org/wiki/OTRS [23:05:09] (03PS1) 10BryanDavis: toollabs: Replace trusty PHP5 session cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/289973 (https://phabricator.wikimedia.org/T135861) [23:18:19] bd808: why aren't you using the 'cron' resource directly? [23:18:40] ori: ... good question [23:19:10] there is an existing file that php5-common installs I'd need to ensure absent [23:19:23] or I can just overwrite like this [23:21:19] apt will prompt you about a config file conflict every time you upgrade php5 [23:22:11] I think doing it this way is fine, but I think you should add a comment to the Puppet manifest saying you are overwriting a file that ships with the package [23:22:22] *nod* [23:22:53] shellcheck has some recommendations too: https://dpaste.de/MHkJ/raw [23:23:14] I realize this came from upstream, so up to you if you want to apply them in our version, not apply them at all, or submit them upstream, or whatever [23:23:20] happy to +2 [23:23:58] I think the "-- SC2016: Expressions don't expand in single quotes, use double quotes for that." is bogus [23:24:11] it is seeing the '$k => $v' and assuming you wanted those to be expanded by the shell [23:24:48] the rest seem legit [23:24:51] "Prefer [ p ] && [ q ]" is a bash-ism [23:25:06] isn't it? [23:25:18] [[ ]] is a bashism, I don't think && is [23:25:33] Maybe I'm thinking of [[ p && q ]] [23:25:42] that's definitely a bashism yeah [23:26:08] "fixing" Debian's shell seems nit picky :) [23:26:31] yeah maybe it's best to just leave it alone [23:26:34] (03PS2) 10BryanDavis: toollabs: Replace trusty PHP5 session cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/289973 (https://phabricator.wikimedia.org/T135861) [23:26:42] comment added [23:28:11] "lsof check used in the Ubuntu stock script can hang indefinitely" -- in Hebrew "ein sof" (or nsof) means "without end" or forever. [23:28:18] funny coincidence [23:28:39] or deep conspiracy? ;) [23:28:48] (03CR) 10Ori.livneh: [C: 032] toollabs: Replace trusty PHP5 session cleanup script [puppet] - 10https://gerrit.wikimedia.org/r/289973 (https://phabricator.wikimedia.org/T135861) (owner: 10BryanDavis) [23:29:21] merged [23:29:44] cool. I'll for a couple of puppet runs [23:29:49] *force [23:34:34] (03PS9) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [23:35:42] (03CR) 10jenkins-bot: [V: 04-1] install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [23:41:17] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/2866/carbon.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [23:47:19] (03PS10) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [23:56:18] (03PS11) 10Dzahn: install_server: split out reprepro to module aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757)