[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160915T0000). Please do the needful. [00:00:07] (03CR) 10Paladox: [C: 031] contint: don't use ensure 'latest' with php packages [puppet] - 10https://gerrit.wikimedia.org/r/310704 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [00:01:15] 06Operations, 10ops-eqiad: Broken disk on copper - https://phabricator.wikimedia.org/T144261#2593788 (10faidon) @Cmjohnson Ping? What's the status of this? [00:06:18] (03Draft1) 10Paladox: ldap: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310706 (https://phabricator.wikimedia.org/T115348) [00:07:42] (03PS2) 10Paladox: ldap: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310706 (https://phabricator.wikimedia.org/T115348) [00:08:01] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2638985 (10Paladox) [00:10:29] (03Draft1) 10Paladox: contint: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310708 (https://phabricator.wikimedia.org/T115348) [00:14:57] (03PS2) 10Paladox: contint: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310708 (https://phabricator.wikimedia.org/T115348) [00:17:41] RECOVERY - puppet last run on mw2079 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:19:05] (03Draft1) 10Paladox: gridengine: use present instead of latest in package [puppet] - 10https://gerrit.wikimedia.org/r/310710 (https://phabricator.wikimedia.org/T115348) [00:19:55] (03PS2) 10Paladox: gridengine: use present instead of latest in package [puppet] - 10https://gerrit.wikimedia.org/r/310710 (https://phabricator.wikimedia.org/T115348) [00:24:09] (03Draft1) 10Paladox: requesttracker: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310711 (https://phabricator.wikimedia.org/T115348) [00:25:53] (03Abandoned) 10Paladox: requesttracker: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310711 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [00:27:23] (03Draft1) 10Paladox: mysql_wmf: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310712 (https://phabricator.wikimedia.org/T115348) [00:28:38] (03PS2) 10Paladox: mysql_wmf: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310712 (https://phabricator.wikimedia.org/T115348) [00:29:07] (03CR) 10Dzahn: [C: 031] mysql_wmf: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310712 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [00:43:59] (03Draft1) 10Paladox: toollabs: Remove unneeded puppet-lint ignore rules [puppet] - 10https://gerrit.wikimedia.org/r/310715 [00:44:01] (03Draft2) 10Paladox: toollabs: Remove unneeded puppet-lint ignore rules [puppet] - 10https://gerrit.wikimedia.org/r/310715 [00:59:40] (03PS1) 10Dzahn: apache: fix 42 x 'class not documented', add doc links [puppet] - 10https://gerrit.wikimedia.org/r/310717 (https://phabricator.wikimedia.org/T127797) [01:07:12] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 27465 bytes in 0.196 second response time [01:08:01] uhm, [01:08:07] I didn't do it [01:08:21] hmm [01:08:53] takes a look at the actual check command on icinga server [01:09:14] phabricator isn't having any real issue that I can see [01:10:15] twentyafterfour: "Wikimedia and MediaWiki" used to be on the frontpage but now it's not? [01:11:23] twentyafterfour: if i change it to just check if "Wikimedia" it's OK again [01:11:38] footer text changed? [01:11:50] it was supposed to be something that (almost) never changes [01:13:40] weird, it's still there "Phabricator is a collaboration platform open to all Wikimedia and MediaWiki contributors." [01:13:43] PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 27465 bytes in 0.231 second response time [01:15:06] mutante: and why did it just happen now, I don't think anything changed - I didn't actually deploy an update this week [01:15:12] it does find the string "happens" [01:15:18] it does not find "MediaWiki" [01:15:19] what the .. [01:16:12] RECOVERY - https://phabricator.wikimedia.org on phab2001 is OK: HTTP OK: HTTP/1.1 200 OK - 27401 bytes in 0.238 second response time [01:16:16] lol, what [01:16:23] huh? [01:16:27] weird [01:16:30] ok wtf? [01:16:52] /usr/lib/nagios/plugins/check_http -S -H 'phabricator.wikimedia.org' -I misc-web-lb.wikimedia.org -u 'https://phabricator.wikimedia.org/' -s 'collaboration platform' [01:16:55] HTTP OK: HTTP/1.1 200 OK [01:17:04] (03PS1) 10Andrew Bogott: Puppet Panel: Actually populate the prefix panel with prefixes. [puppet] - 10https://gerrit.wikimedia.org/r/310718 (https://phabricator.wikimedia.org/T91990) [01:17:19] (03PS1) 10Thcipriani: [WIP] Beta: Clean puppetmaster cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) [01:17:22] /usr/lib/nagios/plugins/check_http -S -H 'phabricator.wikimedia.org' -I misc-web-lb.wikimedia.org -u 'https://phabricator.wikimedia.org/' -s 'MediaWiki' [01:17:25] HTTP CRITICAL: HTTP/1.1 200 OK - string 'MediaWiki' not found [01:17:28] and now .. it is back to normal [01:18:24] /usr/lib/nagios/plugins/check_http -S -H 'phabricator.wikimedia.org' -I misc-web-lb.wikimedia.org -u 'https://phabricator.wikimedia.org/' -s 'Wikimedia and MediaWiki' [01:18:27] HTTP CRITICAL: HTTP/1.1 200 OK - string 'Wikimedia and MediaWiki' not found [01:18:34] or not [01:20:17] if it was a random issue with the icinga host it would not happen on both iridium and phab2001 and nothing else.. weeeeird [01:20:21] (03PS2) 10Thcipriani: [WIP] Beta: Clean puppetmaster cherry-picks [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) [01:20:49] why is it even running on phab2001? [01:21:07] because it is in the role and the role gets applied on both hosts [01:21:20] I didn't know we had phab2001's httpd configured [01:21:29] and I thought we disabled the checks [01:21:45] 1302 node /^(iridium\.eqiad|phab2001\.codfw)\.wmnet$/ { [01:21:49] ^ everything is the same [01:22:02] so that it has httpd is not a surprise [01:22:04] curl https://phabricator.wikimedia.org -s | grep "Wikimedia and MediaWiki" works for me every time [01:22:38] this has been running for a long time like this .. uhm... [01:22:53] yeah and no code changes happened at all today [01:23:32] PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 27465 bytes in 0.213 second response time [01:23:44] ok, so now we are flapping too [01:23:54] you see that url there? [01:23:59] https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ [01:24:01] yea, but i already checked that [01:24:04] (03CR) 10Krinkle: [C: 031] Set some database logging groups to log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310159 (owner: 10Aaron Schulz) [01:24:05] it's not that [01:24:13] the command i pasted above is what it runs [01:24:16] why is the irc alert saying that? [01:24:46] maybe there is a duplicate alert set up in icinga? [01:24:53] and you are looking at the good, working one? [01:25:07] but the failing one has a broken url? [01:25:40] that's the only thing I can think of really [01:25:47] * twentyafterfour opens icinga [01:25:51] nope, exactly 2 checks [01:25:54] in the config file [01:26:02] and "443" doesnt even appear in the entire file [01:27:08] the checkcommands.cfg: command_line $USER1$/check_http -S -H 'phabricator.wikimedia.org' -I misc-web-lb.wikimedia.org -u 'https://phabricator.wikimedia.org/' -s 'Wikimedia and MediaWiki' [01:27:15] that is the checkcommand [01:27:24] no port in there, and also unchanged [01:28:25] it works everytime if i just check for "Wikimedia" by itself [01:28:34] and leave everything else the same [01:29:14] WEIRD [01:29:37] just check for wikimedia then? that doesn't make sense to me. Nor does the messed up url in the log message [01:29:43] it never works for "MediaWiki" by itself [01:29:52] wtf [01:29:53] but i see that word in my browser [01:29:57] yeah me too [01:30:13] does it get logged in ? :P [01:30:15] who stole our grep and replaced it with bizaro-grep? [01:30:39] uhm I don't know [01:30:54] lol, "focus on bug" also works [01:31:08] maybe i should use that :) [01:31:37] anything really, as long as it can tell the difference between a stack trace and the phabricator home page [01:32:13] so any of the custom wording on that page, since that stuff comes from the database and requires a working php environment to render it [01:32:57] yea, but such a mystery.. this has been working for months [01:33:08] tries to find something in access logs [01:33:11] I can't help but think it has something to do with the :443 part that's getting logged [01:33:41] that's gotta be coming from somewhere - I've seen these alerts before and I don't think they were broken before [01:34:36] https://phabricator.wikimedia.org/?__path__=%2f [01:34:41] that is in the access log [01:34:59] where does the ?_path_ part come from [01:35:05] but anyways, i see the string there too [01:35:34] no I'm wrong, an old alert in my email has the messed up url too [01:35:50] i think that is just a separate bug in the output of check_http [01:36:08] Date/Time: Sun Aug 9 06:08:33 UTC 2015 .... string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ [01:36:18] it's also always 200 [01:36:22] yeah unrelated [01:36:33] it's not _always_ 200 ;) [01:36:41] HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 1764 bytes in 0.876 second response time [01:36:43] Love, Icinga [01:36:45] it's when you specify hostname and URL and then it concatenates them [01:37:04] hmm ok, and this time we didnt see a single 500 [01:37:11] just not finding the content .. [01:38:23] twentyafterfour: it gets better, "ediaWiki" yes, "MediaWiki" no [01:38:24] I can't think of any reason for it. I mean, maybe the checker script isn't waiting long enough, only downloads partial content and the partial content is cut off before the critical part we are grepping for? [01:38:35] wtf [01:38:52] and it's not even april 1st [01:39:00] -s 'ediaWiki' [01:39:01] HTTP OK: HTTP/1.1 200 OK [01:39:12] -s 'MediaWiki' [01:39:13] HTTP CRITICAL: HTTP/1.1 200 OK [01:40:57] repeated that 10 times (!) always the same [01:41:28] something something about a cached version it gets? [01:41:35] but still doesnt make sense [01:41:41] uhm... what is the checker script using to download the page content? [01:41:52] the page shouldn't be cached really, other than php bytecode cache [01:42:26] i was thinking of misc-web [01:42:38] it asks misc-web-lb [01:42:49] if the page is chunked transfer encoding then maybe the chunk boundary is lining up with the M in MediaWiki [01:43:08] it uses check_http which is a binary file [01:43:23] misc-web-lb shouldn't cache unless phabricator says it's a cacheable page though right? I thought it was just being a reverse proxy for https reasons... [01:43:25] comes from standard nagios plugins, used a hundred times [01:43:57] let me check that [01:44:09] Expires: Sat, 01 Jan 2000 00:00:00 GMT [01:44:19] cache control: no-store [01:45:00] x-cache-status: pass [01:47:48] i was looking for "pass" in cache/misc.pp , yea, it changed format though [01:48:20] i'll change the string now.. just to proof how weird it is [01:52:38] (03PS1) 10Dzahn: icinga/phab: adjust checkcommand for phabricator http check [puppet] - 10https://gerrit.wikimedia.org/r/310721 [01:54:56] (03CR) 10Dzahn: [C: 032] "changing string to "focus on bug", because that's what we do :p" [puppet] - 10https://gerrit.wikimedia.org/r/310721 (owner: 10Dzahn) [01:56:40] runs puppet on neon.. [01:58:27] (03CR) 10Dzahn: [C: 032] "yep, this must have been fixed meanwhile. the Verified + 2 from jenkins is proof the ignore's are not needed anymore" [puppet] - 10https://gerrit.wikimedia.org/r/310715 (owner: 10Paladox) [01:58:38] (03PS3) 10Dzahn: toollabs: Remove unneeded puppet-lint ignore rules [puppet] - 10https://gerrit.wikimedia.org/r/310715 (owner: 10Paladox) [01:59:27] check command changed on Icinga server,, expecting recoveries any minute [02:02:02] reschedules next service check [02:04:07] !log ms-be1022 - down per icinga, but also mgmt is not reachable [02:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:04:14] icinga, come on ... [02:04:31] it's been 5 minutes, wtf [02:05:45] lets it reload config [02:08:10] RECOVERY - https://phabricator.wikimedia.org on phab2001 is OK: HTTP OK: HTTP/1.1 200 OK - 27465 bytes in 0.209 second response time [02:08:17] ACKNOWLEDGEMENT - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T140597 [02:08:33] :-/ [02:08:40] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 27465 bytes in 0.189 second response time [02:08:50] well, lol [02:09:04] i got a text too [02:09:24] changed nothing, just the string is "focus on bug" now [02:09:34] which somehow fits [02:13:17] yeah no doubt [02:13:22] twentyafterfour: omg, i see something [02:13:27] ? [02:13:27] check_http has -vvv [02:13:35] and then it shows me the content it gets [02:13:46] and the line with "MediaWiki contributors" it looks like this: [02:13:52]
>

Phabricator is a collaboration platform open to all Wikimedia and M [02:13:56] 1000 [02:13:58] ediaWiki contributors. We focus on bug [02:14:10] M1000ediaWiki ??? [02:14:10] uh [02:14:22] invisible unicode char? [02:14:25] I don't know [02:14:43] it does not look like this when i looked at source in browser [02:14:47] like I said before though - it _could_ be somehow chunking the text and the boundary happens to be there [02:14:57] yeah looks find with curl too [02:15:11] that's why "ediaWiki" works [02:15:11] yep [02:15:24] _some_ kind of boundary is there, not sure wtf it is or why something changed suddenly [02:16:04] looks *fine* with curl, I mean [02:16:55] the "1000" appears in another place too [02:17:02] 1000 [02:17:03] ass="phui-list-item-href"> [02:17:55] are they 1000 bytes apart? [02:19:46] no, it's like 3117 bytes until the first [02:20:27] cant think of anything that changed today .. [02:32:46] twentyafterfour: i gotta go for now, it's not 1000 bytes and i tried a few things but still nothing that explains it [02:33:50] mutante: it's a mystery [02:34:07] thanks for debugging it with me though. I can't think of much else to look at [02:39:47] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.18) (duration: 17m 49s) [02:39:54] PROBLEM - puppet last run on mw2078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:59] twentyafterfour: yw, https://github.com/nagios-plugins/nagios-plugins/issues/76 < looks like this or similar [02:41:02] bbl [02:50:40] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [03:07:29] RECOVERY - puppet last run on mw2078 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:17:05] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.19) (duration: 18m 25s) [03:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:23:50] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Sep 15 03:23:49 UTC 2016 (duration 6m 44s) [03:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:27:38] (03PS1) 10Alex Monk: puppet::self::gitclone: Get gitdir from puppetmaster::base_repo [puppet] - 10https://gerrit.wikimedia.org/r/310729 [04:52:02] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:17:01] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:24:26] <_joe_> !log turning off nitrogen for memory reduction, reimage [06:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:29:20] !log installing chromium security updates on osmium [06:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:32] PROBLEM - puppet last run on ms-be2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:54:51] !log renaming tables before dropping them - T145487 [06:54:52] T145487: Investigate (and if possible drop _counters) - https://phabricator.wikimedia.org/T145487 [06:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:55:33] RECOVERY - puppet last run on ms-be2025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:59] !log installing libidn security updates [07:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:05:56] (03PS1) 10Giuseppe Lavagetto: puppetdb: absent the debian-shipped config file [puppet] - 10https://gerrit.wikimedia.org/r/310742 [07:06:51] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetdb: absent the debian-shipped config file [puppet] - 10https://gerrit.wikimedia.org/r/310742 (owner: 10Giuseppe Lavagetto) [07:11:52] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:16:29] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2639273 (10Tgr) The goals state //Even with remote execution on the client, there i... [07:20:14] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:20:31] (03CR) 10Marostegui: [C: 031] Drop the malloc wrapper from mysqld_safe [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310529 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [07:23:07] (03CR) 10Jcrespo: ""Currently this poses some risks, e.g. a broken Ubuntu/Debian update would spread across the cluster automatically."" [puppet] - 10https://gerrit.wikimedia.org/r/310702 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [07:33:58] 06Operations, 10ops-eqiad, 10DBA: db1082 hardware check - https://phabricator.wikimedia.org/T145607#2639280 (10Marostegui) Note: Replication was started and it went well. The host was powered off a bit after for a memtest [07:37:20] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:44:07] (03CR) 10Legoktm: [C: 04-1] "This is intentional, CI specifically wants the latest version of these packages." [puppet] - 10https://gerrit.wikimedia.org/r/310704 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [07:44:31] (03CR) 10Legoktm: [C: 04-1] "This is intentional, we want the latest versions of these packages." [puppet] - 10https://gerrit.wikimedia.org/r/310708 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [07:45:52] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:57:52] 06Operations, 10DBA: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676#2639339 (10Marostegui) a:03Marostegui [07:58:10] What did that do? lol [08:00:11] (03CR) 10Jcrespo: [C: 04-2] "See: https://gerrit.wikimedia.org/r/301076" [puppet] - 10https://gerrit.wikimedia.org/r/310712 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [08:02:09] (03CR) 10Jcrespo: [C: 031] Drop the malloc wrapper from mysqld_safe [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310529 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [08:06:41] (03PS1) 10Muehlenhoff: Update list of mailman site languages [puppet] - 10https://gerrit.wikimedia.org/r/310746 (https://phabricator.wikimedia.org/T144933) [08:11:16] !log altering tables in S7 - eqiad hosts - T141951 [08:11:17] T141951: Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951 [08:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:11:22] (03CR) 10Hashar: "That is part of an effort to get rid of puppet ensure => latest (T115348), but as Legoktm said it is intentional here." [puppet] - 10https://gerrit.wikimedia.org/r/310708 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [08:12:28] (03CR) 10Hashar: [C: 04-1] "That is intentional until we migrate to use unattended-upgrade (see my comment on https://gerrit.wikimedia.org/r/#/c/310708/ )." [puppet] - 10https://gerrit.wikimedia.org/r/310704 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [08:15:00] (03CR) 10Hashar: "CI has a use case for ensure => latest, I have commented on https://gerrit.wikimedia.org/r/310708 how we can migrate to use unattended upg" [puppet] - 10https://gerrit.wikimedia.org/r/310702 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [08:18:36] (03CR) 10Hashar: [C: 031] "From https://gerrit.wikimedia.org/r/308311" [puppet] - 10https://gerrit.wikimedia.org/r/310692 (https://phabricator.wikimedia.org/T1357) (owner: 10Dzahn) [08:19:11] Error: 503, Backend fetch failed at Thu, 15 Sep 2016 08:18:46 GMT creating a new task on phab [08:19:34] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: some icinga checks on WDQS do not send notifications - https://phabricator.wikimedia.org/T144948#2639472 (10Gehel) So it seems that puppet failures on wdqs1001 are notified on IRC (# wikidata), but... [08:20:41] (03CR) 10Hashar: "It looks like those instances have a broken puppet.conf and can't reach the puppetmaster / point to a wrong fqdn. So yeah unrelated." [puppet] - 10https://gerrit.wikimedia.org/r/308322 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [08:23:19] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2639499 (10elukey) [08:23:31] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2586022 (10elukey) The task is currently blocked by T145611 [08:23:48] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2639519 (10elukey) a:03elukey [08:26:54] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:27:59] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: some icinga checks on WDQS do not send notifications - https://phabricator.wikimedia.org/T144948#2639565 (10Gehel) Actually, it seems that only the `WDQS_Lag` check is not reported to the wdqs-admi... [08:31:39] 06Operations, 06Release-Engineering-Team, 07HHVM: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2639586 (10hashar) @mmodell @thcipriani @demon @dduvall can you check mira02 on beta is all fine ? I dont feel confident double checking that is working properly. A... [08:31:46] 06Operations, 06Performance-Team, 10Thumbor: Use intermediary high-quality JPEGs rather than PNGs for PDF thumbnailing - https://phabricator.wikimedia.org/T145637#2639589 (10Gilles) [08:33:10] 06Operations, 06Operations-Software-Development, 07HHVM: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#2639607 (10MoritzMuehlenhoff) [08:35:53] (03PS1) 10Gehel: wdqs - send notifications of WDQS lag also to wdqs-admins group [puppet] - 10https://gerrit.wikimedia.org/r/310748 (https://phabricator.wikimedia.org/T144948) [08:36:52] (03CR) 10jenkins-bot: [V: 04-1] wdqs - send notifications of WDQS lag also to wdqs-admins group [puppet] - 10https://gerrit.wikimedia.org/r/310748 (https://phabricator.wikimedia.org/T144948) (owner: 10Gehel) [08:37:28] (03PS2) 10Gehel: wdqs - send notifications of WDQS lag also to wdqs-admins group [puppet] - 10https://gerrit.wikimedia.org/r/310748 (https://phabricator.wikimedia.org/T144948) [08:38:50] (03PS1) 10Elukey: Remove mediawiki03 from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/310749 (https://phabricator.wikimedia.org/T144006) [08:39:15] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:41:24] 06Operations, 06Multimedia, 06Operations-Software-Development, 10TimedMediaHandler, 07HHVM: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#2639681 (10hashar) On beta one can try upgrading/switching deployment-tmh01.deployment-prep.eqiad.wmflabs (tmh stands for TimedMediaHand... [08:43:33] @time [08:43:33] Time now (various TZ): Sep 15, 2016 8:43AM UTC | 4:43AM EDT |10:43AM CEST | 6:43PM AEST [08:43:50] (03PS2) 10Urbanecm: Fix hewiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310228 (https://phabricator.wikimedia.org/T145017) [08:44:13] moin hashar [08:44:31] (03PS1) 10Muehlenhoff: Add mira02 to dsh groups for labs [puppet] - 10https://gerrit.wikimedia.org/r/310752 (https://phabricator.wikimedia.org/T144578) [08:44:35] addshore: o/ [08:44:41] jouncebot: next [08:44:41] In 4 hour(s) and 15 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160915T1300) [08:44:45] jouncebot: now [08:44:45] No deployments scheduled for the next 4 hour(s) and 15 minute(s) [08:44:48] (03CR) 10Elukey: [C: 032] Remove mediawiki03 from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/310749 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [08:45:20] hashar: im going to schedule a slot just to push out some i18n updates for RevisionSlider https://gerrit.wikimedia.org/r/#/c/310751 which will need a full scap, rather than taking up most of eu swat.... [08:45:22] (03PS1) 10Urbanecm: Fix logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310753 (https://phabricator.wikimedia.org/T145017) [08:47:15] (03PS2) 10Urbanecm: Fix logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310753 (https://phabricator.wikimedia.org/T145017) [08:47:45] (03PS3) 10Urbanecm: Fix hewiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310228 (https://phabricator.wikimedia.org/T145017) [08:49:19] hashar: sound okay? [08:53:08] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:53:28] addshore: -wmf.19 is only on group0 for now [08:53:54] I have hold the train due to a couple bugs (which should be fixed now [08:54:13] and group1? [08:54:15] ahhhh [08:54:31] but yeah can do a full scap before the eu swat [08:54:35] I am busy this morning / lunch time [08:54:40] so lets say 2pm ? [08:55:07] 2pm what tZ? ;) [08:55:31] noon UTC / 14:00 CEST ? :D [08:55:37] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:55:43] (03CR) 10Filippo Giunchedi: [C: 032] install_server: use separate /srv for bastions [puppet] - 10https://gerrit.wikimedia.org/r/309995 (owner: 10Filippo Giunchedi) [08:55:47] that puppet error on gallium is not me [08:55:55] okay! I think I am CEST (now in belgium) I'll add it to the calander! [08:57:05] {{done}} [08:57:11] 06Operations, 06Performance-Team, 10Thumbor: VIPS engine should generate JPG when dealing with TIFFs and not have the IM engine read it - https://phabricator.wikimedia.org/T145638#2639821 (10Gilles) [08:57:58] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pg_basebackup-labsdb1006.eqiad.wmnet] [08:58:09] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:58:31] (03PS3) 10Filippo Giunchedi: site: install prometheus server in esams and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/309996 (https://phabricator.wikimedia.org/T126785) [08:58:33] (03PS3) 10Filippo Giunchedi: install_server: use separate /srv for bastions [puppet] - 10https://gerrit.wikimedia.org/r/309995 [08:59:32] (03PS1) 10Elukey: Add mediawiki06 to the deployment-prep scap dsh [puppet] - 10https://gerrit.wikimedia.org/r/310756 (https://phabricator.wikimedia.org/T144006) [09:01:08] <_joe_> hashar: the error on gallium was me but it was a noop run [09:01:14] (03CR) 10Filippo Giunchedi: [C: 032] install_server: use separate /srv for bastions [puppet] - 10https://gerrit.wikimedia.org/r/309995 (owner: 10Filippo Giunchedi) [09:01:37] (03CR) 10Elukey: [C: 032] Add mediawiki06 to the deployment-prep scap dsh [puppet] - 10https://gerrit.wikimedia.org/r/310756 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [09:01:39] _joe_: great! thank you :) [09:01:44] (03PS2) 10Elukey: Add mediawiki06 to the deployment-prep scap dsh [puppet] - 10https://gerrit.wikimedia.org/r/310756 (https://phabricator.wikimedia.org/T144006) [09:02:11] * elukey just got rebase-snipered by godog [09:02:31] haha oops elukey [09:02:33] (03CR) 10Elukey: [V: 032] Add mediawiki06 to the deployment-prep scap dsh [puppet] - 10https://gerrit.wikimedia.org/r/310756 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [09:02:48] :D [09:02:49] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:03:28] jouncebot: refresh [09:03:31] I refreshed my knowledge about deployments. [09:03:34] jouncebot: next [09:03:35] In 2 hour(s) and 56 minute(s): RevisionSlider (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160915T1200) [09:03:58] addshore: thanks. Ping me before the window since I will certainly forget and I am not sure whether jouncebot is going to ping :D [09:07:10] (03PS1) 10Alexandros Kosiaris: osm: Also allow postgres connection from postgres slave [puppet] - 10https://gerrit.wikimedia.org/r/310759 [09:07:58] (03CR) 10Volans: "Although I don't have much context, see my 2 cents inline" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [09:13:03] 06Operations, 06Performance-Team, 10Thumbor: pdf failure - https://phabricator.wikimedia.org/T145617#2639859 (10Gilles) [09:13:56] sigh, bast3001 console is borked, or at least not the right settings, seeing only garbage from agetty [09:14:07] not sure I want to reimage it in case it goes south [09:14:25] <_joe_> don't [09:14:48] <_joe_> also our dc technician in AMS is not there right now :) [09:15:25] indeed [09:17:33] hashar: will do [09:17:51] 06Operations, 06Performance-Team, 10Thumbor: djvu failure for very high page number - https://phabricator.wikimedia.org/T145616#2639866 (10Gilles) [09:25:03] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2639871 (10Gilles) For the record, that original plays fine in firefox. It's likely to be an ffmpeg bug fixed in newer versions of ffmpeg. [09:25:14] godog: I had that a few times, is it still garbled when just pressing Enter? [09:26:03] AFAICS sometimes there's noise on the output on the immediate connection, but when simply pressing Enter I got a tty console shell [09:26:43] moritzm: yeah still garbled, no reaction to enter and restarting agetty outputs more garbage [09:27:18] ah, hardware sucks :-/ [09:28:42] quite, never fails to disappoint [09:29:09] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2639881 (10ema) @JasperStPierre reported another occurrence of this issue on IRC (2016-09-14 22:18 UTC): https://upload.wikimedia.org/wikipedia/commons/thumb/... [09:30:28] bah, same settings as nescio, where the console works [09:35:10] volans: I'll try wmf-auto-reimage another time :( [09:35:15] !log reimaging mw1250 to jessie [09:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:35:27] godog: yeah, no problem [09:36:29] (03PS2) 10Jcrespo: Copy over mysql_wmf::mylvmbackup to mariadb [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 (owner: 10Paladox) [09:36:55] 06Operations, 10ops-esams: bast3001 garbled console - https://phabricator.wikimedia.org/T145756#2639892 (10fgiunchedi) [09:39:04] (03CR) 10Jcrespo: "Ok with the change except I do not understand the logrotate change. Maybe the change was applied after it was proposed?" (031 comment) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 (owner: 10Paladox) [09:40:25] (03PS2) 10Jcrespo: Drop the malloc wrapper from mysqld_safe [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310529 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [09:42:12] 06Operations, 10DBA: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676#2639920 (10Marostegui) This table currently exists at: S1 (enwiki - ie: db1053) S3 (testwiki - ie: db1044) As it has been said, it has not been written in a long time. ``` root@db1053:/sr... [09:42:25] (03CR) 10Filippo Giunchedi: "merged but reimage of bast3001 postponed -- https://phabricator.wikimedia.org/T145756" [puppet] - 10https://gerrit.wikimedia.org/r/309995 (owner: 10Filippo Giunchedi) [09:43:10] (03PS2) 10Filippo Giunchedi: prometheus: add aggregation rules for ops [puppet] - 10https://gerrit.wikimedia.org/r/307310 [09:44:38] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add aggregation rules for ops [puppet] - 10https://gerrit.wikimedia.org/r/307310 (owner: 10Filippo Giunchedi) [09:45:00] (03CR) 10Jcrespo: [C: 031] "I am unsure about this- if we change mysqld_safe for our setup, I would hardcode it to only use /etc/my.cnf or defaults-file / defaults-ex" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310530 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [09:45:51] (03PS1) 10Ema: cache_upload fe: do not set do_stream=true on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/310767 (https://phabricator.wikimedia.org/T144257) [09:49:18] (03CR) 10Muehlenhoff: "Ack, this patch only fixes the additional vector mentioned in the advisory. We can additionally hardcore in a followup patch." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310530 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [09:54:16] (03CR) 10Paladox: "@Jcrespo I copied it and didn't change that file." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 (owner: 10Paladox) [09:55:58] (03PS3) 10Paladox: Copy over mysql_wmf::mylvmbackup to mariadb [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 [09:56:21] (03CR) 10Paladox: Copy over mysql_wmf::mylvmbackup to mariadb (031 comment) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 (owner: 10Paladox) [09:56:51] (03PS1) 10Volans: Auto-reimage: increase timeout for Icinga command [puppet] - 10https://gerrit.wikimedia.org/r/310768 (https://phabricator.wikimedia.org/T143536) [09:57:07] (03CR) 10Paladox: "And yep looks like it was done after I uploaded this patch. https://phabricator.wikimedia.org/rOPUP385cddbe7ab7ccaf1d4a2ee68b1fa2b5cc4e76b" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 (owner: 10Paladox) [09:58:16] (03CR) 10Volans: [C: 032] Auto-reimage: increase timeout for Icinga command [puppet] - 10https://gerrit.wikimedia.org/r/310768 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [09:59:38] (03PS4) 10Filippo Giunchedi: monitoring: add check_prometheus define [puppet] - 10https://gerrit.wikimedia.org/r/307269 [10:00:50] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Exec[create_postgres_db_puppetdb],Service[postgresql@9.4-main] [10:01:32] (03CR) 10Marostegui: [C: 031] Don't source a custom mysql configuration from /srv/sqldata/my.conf [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310530 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [10:02:31] (03CR) 10Jcrespo: [C: 032] Copy over mysql_wmf::mylvmbackup to mariadb [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 (owner: 10Paladox) [10:02:34] (03CR) 10Filippo Giunchedi: [C: 032] monitoring: add check_prometheus define [puppet] - 10https://gerrit.wikimedia.org/r/307269 (owner: 10Filippo Giunchedi) [10:02:53] (03PS1) 10Alexandros Kosiaris: postgres: Have postgres::db require the server class [puppet] - 10https://gerrit.wikimedia.org/r/310770 [10:03:04] (03PS2) 10Jcrespo: Don't source a custom mysql configuration from /srv/sqldata/my.conf [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310530 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [10:03:42] (03CR) 10Volans: monitoring: add check_prometheus define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/307269 (owner: 10Filippo Giunchedi) [10:04:16] (03CR) 10Jcrespo: [C: 032] Don't source a custom mysql configuration from /srv/sqldata/my.conf [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310530 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [10:04:21] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/310557 (https://phabricator.wikimedia.org/T145659) (owner: 10Filippo Giunchedi) [10:06:04] (03CR) 10Paladox: "Thanks." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 (owner: 10Paladox) [10:07:11] (03PS1) 10Alexandros Kosiaris: check_postgres_replication_lag.py: Rewrite parts of it [puppet] - 10https://gerrit.wikimedia.org/r/310772 [10:07:32] volans: thanks, I'll followup with another gerrit review re: check_prometheus [10:07:36] (03PS2) 10Alexandros Kosiaris: postgres: Have postgres::db require the server class [puppet] - 10https://gerrit.wikimedia.org/r/310770 [10:07:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] postgres: Have postgres::db require the server class [puppet] - 10https://gerrit.wikimedia.org/r/310770 (owner: 10Alexandros Kosiaris) [10:08:00] godog: just a suggestion, completely optional ;) [10:09:02] yeah I think it makes sense the more errors we can catch with PCC the better [10:09:11] (03CR) 10jenkins-bot: [V: 04-1] check_postgres_replication_lag.py: Rewrite parts of it [puppet] - 10https://gerrit.wikimedia.org/r/310772 (owner: 10Alexandros Kosiaris) [10:09:31] !log reimaging mw1251 to jessie [10:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:09:53] 06Operations, 06Discovery, 06Maps, 10Traffic, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2639951 (10Gehel) [10:10:05] (03PS2) 10Ema: cache_upload fe: do not set do_stream=true on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/310767 (https://phabricator.wikimedia.org/T144257) [10:10:11] (03CR) 10Ema: [C: 032 V: 032] cache_upload fe: do not set do_stream=true on Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/310767 (https://phabricator.wikimedia.org/T144257) (owner: 10Ema) [10:11:08] 06Operations, 06Discovery, 06Maps, 10Traffic, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2639971 (10Gehel) [10:11:25] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2639972 (10fgiunchedi) quite possible, I see stretch has got ffmpeg 3.1 so we could update it in jessie-backports too. I couldn't get the full command line used from the logs though to test... [10:11:40] (03PS1) 10Elukey: Set mediawiki06 (replacement of mediawiki03) as security audit target [puppet] - 10https://gerrit.wikimedia.org/r/310773 (https://phabricator.wikimedia.org/T144006) [10:12:14] (03PS2) 10Elukey: Set mediawiki06 (replacement of mediawiki03) as security audit target [puppet] - 10https://gerrit.wikimedia.org/r/310773 (https://phabricator.wikimedia.org/T144006) [10:15:12] (03CR) 10Hashar: [C: 031] Set mediawiki06 (replacement of mediawiki03) as security audit target [puppet] - 10https://gerrit.wikimedia.org/r/310773 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [10:16:14] (03CR) 10Elukey: [C: 032] Set mediawiki06 (replacement of mediawiki03) as security audit target [puppet] - 10https://gerrit.wikimedia.org/r/310773 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [10:16:25] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2635959 (10MoritzMuehlenhoff) Firefox uses an internal copy of libvpx for decoding vp9 files and only dlopens gstreamer-ffmpeg for a few selected codecs. Updating ffmpeg is still a good idea... [10:16:42] (03PS2) 10Alexandros Kosiaris: check_postgres_replication_lag.py: Rewrite parts of it [puppet] - 10https://gerrit.wikimedia.org/r/310772 [10:17:51] <_joe_> a large batch is incoming... [10:17:54] (03PS1) 10Giuseppe Lavagetto: varnish: move templates to module [puppet] - 10https://gerrit.wikimedia.org/r/310774 [10:17:56] (03PS1) 10Giuseppe Lavagetto: role::memcached: move template to module [puppet] - 10https://gerrit.wikimedia.org/r/310775 [10:17:58] (03PS1) 10Giuseppe Lavagetto: backups: move templates to module [puppet] - 10https://gerrit.wikimedia.org/r/310776 [10:18:00] (03PS1) 10Giuseppe Lavagetto: installserver: move squid template to modules [puppet] - 10https://gerrit.wikimedia.org/r/310777 [10:18:02] (03PS1) 10Giuseppe Lavagetto: graphite: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310778 [10:18:04] (03PS1) 10Giuseppe Lavagetto: role::logging: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310779 [10:18:06] (03PS1) 10Giuseppe Lavagetto: kibana: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310780 [10:18:08] (03PS1) 10Giuseppe Lavagetto: toollabs: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310781 [10:18:10] (03PS1) 10Giuseppe Lavagetto: standard::mail::sender: move template to module [puppet] - 10https://gerrit.wikimedia.org/r/310782 [10:18:12] (03PS1) 10Giuseppe Lavagetto: role::mha: move templates to module [puppet] - 10https://gerrit.wikimedia.org/r/310783 [10:18:14] (03PS1) 10Giuseppe Lavagetto: templates: clean up templates/misc [puppet] - 10https://gerrit.wikimedia.org/r/310784 [10:18:23] <_joe_> goodby grrrit-wm [10:18:29] <_joe_> it was 16 changes, for the record [10:24:00] (03PS1) 10Muehlenhoff: Also handle systemd in keyholder script [puppet] - 10https://gerrit.wikimedia.org/r/310793 (https://phabricator.wikimedia.org/T144578) [10:26:13] lol [10:26:17] (03PS1) 10Hashar: admin: update hashar gdbinit script [puppet] - 10https://gerrit.wikimedia.org/r/310794 [10:27:26] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2640055 (10Gilles) So would updating libvpx, then, if that's what ffmpeg uses too? [10:27:44] (03PS1) 10Alexandros Kosiaris: postgres: Provide new packages class for others to rely on [puppet] - 10https://gerrit.wikimedia.org/r/310795 [10:29:23] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2640058 (10Gilles) [10:29:29] RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:29:29] (03Abandoned) 10Urbanecm: Fix logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310753 (https://phabricator.wikimedia.org/T145017) (owner: 10Urbanecm) [10:29:42] (03Abandoned) 10Urbanecm: Fix hewiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310228 (https://phabricator.wikimedia.org/T145017) (owner: 10Urbanecm) [10:29:44] (03CR) 10Filippo Giunchedi: [C: 031] delete ircyall module and role [puppet] - 10https://gerrit.wikimedia.org/r/310692 (https://phabricator.wikimedia.org/T1357) (owner: 10Dzahn) [10:31:06] 06Operations, 06Performance-Team, 10Thumbor: 'NoneType' object has no attribute 'lstrip' - https://phabricator.wikimedia.org/T145505#2640062 (10Gilles) Found the answer: firefox was modifying the svg contents when opening the original. This should have been fixed by my tweaking of the SVG detection code. [10:31:26] (03PS1) 10Elukey: Remove mediawiki02 from deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/310796 (https://phabricator.wikimedia.org/T144006) [10:33:41] (03PS2) 10Elukey: Remove mediawiki02 from deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/310796 (https://phabricator.wikimedia.org/T144006) [10:37:14] (03PS1) 10Alexandros Kosiaris: postgres: Allow replication user to connect to all databases [puppet] - 10https://gerrit.wikimedia.org/r/310797 [10:37:36] (03PS1) 10Urbanecm: Fix hewiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310798 (https://phabricator.wikimedia.org/T145017) [10:38:22] 06Operations, 06Performance-Team, 10Thumbor: djvu failure for very high page number - https://phabricator.wikimedia.org/T145616#2640091 (10Gilles) This is really odd. When I run the test directly with the following command it passes: ``` nosetests tests/integration/test_types.py:WikimediaTest.test_huge_djvu... [10:40:05] (03PS3) 10Alexandros Kosiaris: check_postgres_replication_lag.py: Rewrite parts of it [puppet] - 10https://gerrit.wikimedia.org/r/310772 [10:40:27] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2640094 (10MoritzMuehlenhoff) ffmpeg generally strifes to use natively implemented encoders/decoders (since that allows them to use common, optimised code shared across codecs) and only uses... [10:41:10] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/4086/" [puppet] - 10https://gerrit.wikimedia.org/r/310774 (owner: 10Giuseppe Lavagetto) [10:41:24] (03CR) 10Hashar: [C: 031] Remove mediawiki02 from deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/310796 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [10:41:34] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish: move templates to module [puppet] - 10https://gerrit.wikimedia.org/r/310774 (owner: 10Giuseppe Lavagetto) [10:42:05] (03CR) 10Elukey: [C: 031] "LGTM, https://puppet-compiler.wmflabs.org/4087/" [puppet] - 10https://gerrit.wikimedia.org/r/310775 (owner: 10Giuseppe Lavagetto) [10:44:25] (03CR) 10Elukey: [C: 04-1] "In modules/backup/manifests/mysqlset.pp there seems to be a reference that has not been addressed:" [puppet] - 10https://gerrit.wikimedia.org/r/310776 (owner: 10Giuseppe Lavagetto) [10:44:37] <_joe_> thanks elukey [10:45:48] _joe_ I hope to get things right :) [10:46:10] (03CR) 10Giuseppe Lavagetto: [C: 032] role::memcached: move template to module [puppet] - 10https://gerrit.wikimedia.org/r/310775 (owner: 10Giuseppe Lavagetto) [10:46:35] (03CR) 10Hashar: "Neat lets apply it on beta :}" [puppet] - 10https://gerrit.wikimedia.org/r/310793 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [10:46:37] (03PS2) 10Alexandros Kosiaris: postgres: Provide new packages class for others to rely on [puppet] - 10https://gerrit.wikimedia.org/r/310795 [10:46:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] postgres: Provide new packages class for others to rely on [puppet] - 10https://gerrit.wikimedia.org/r/310795 (owner: 10Alexandros Kosiaris) [10:46:50] (03PS2) 10Alexandros Kosiaris: postgres: Allow replication user to connect to all databases [puppet] - 10https://gerrit.wikimedia.org/r/310797 [10:46:54] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] postgres: Allow replication user to connect to all databases [puppet] - 10https://gerrit.wikimedia.org/r/310797 (owner: 10Alexandros Kosiaris) [10:47:46] (03CR) 10Giuseppe Lavagetto: [C: 032] "...and that will keep working." [puppet] - 10https://gerrit.wikimedia.org/r/310776 (owner: 10Giuseppe Lavagetto) [10:53:15] (03CR) 10jenkins-bot: [V: 04-1] udp2log: move templates to the proper position [puppet] - 10https://gerrit.wikimedia.org/r/310787 (owner: 10Giuseppe Lavagetto) [10:53:21] (03CR) 10Elukey: "Not sure if relevant but should we update also:" [puppet] - 10https://gerrit.wikimedia.org/r/310777 (owner: 10Giuseppe Lavagetto) [10:53:47] (03CR) 10Giuseppe Lavagetto: "Oh, I see, s/backups/backup/, sorry, will amend." [puppet] - 10https://gerrit.wikimedia.org/r/310776 (owner: 10Giuseppe Lavagetto) [10:54:34] (03CR) 10Elukey: "It seems contained in https://gerrit.wikimedia.org/r/#/c/310778/1" [puppet] - 10https://gerrit.wikimedia.org/r/310777 (owner: 10Giuseppe Lavagetto) [10:55:21] (03PS2) 10Giuseppe Lavagetto: backups: move templates to module [puppet] - 10https://gerrit.wikimedia.org/r/310776 [10:56:27] (03CR) 10Elukey: "Looks good but manifests/role/installserver.pp might be probably be better in 310777?" [puppet] - 10https://gerrit.wikimedia.org/r/310778 (owner: 10Giuseppe Lavagetto) [10:56:49] <_joe_> elukey: nah, I left those all toghether for a reason [10:57:00] <_joe_> I wanted to go dir by dir [10:57:10] ah okok [10:57:22] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] backups: move templates to module [puppet] - 10https://gerrit.wikimedia.org/r/310776 (owner: 10Giuseppe Lavagetto) [10:57:23] well I am just reporting everything that I see to double check :) [10:58:22] (03CR) 10Hashar: [C: 04-1] "cherry picked on beta cluster puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/310793 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [10:58:25] <_joe_> elukey: I'm tempted to merge https://gerrit.wikimedia.org/r/#/c/310789/ first [10:58:37] 06Operations, 06Performance-Team, 10Thumbor: djvu failure for very high page number - https://phabricator.wikimedia.org/T145616#2640143 (10Gilles) I'll switch back to using the command line ddjvu tool. [10:59:01] ahahahaha [10:59:30] (03CR) 10jenkins-bot: [V: 04-1] url_downloader: move templates to role module [puppet] - 10https://gerrit.wikimedia.org/r/310788 (owner: 10Giuseppe Lavagetto) [11:00:23] <_joe_> akosiaris: puppet lint errors are yours I guess [11:00:33] <_joe_> 10:53:13 ./modules/postgresql/manifests/server.pp:54 WARNING indentation of => is not properly aligned (arrow_alignment) [11:01:12] (03CR) 10Elukey: [C: 031] kibana: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310780 (owner: 10Giuseppe Lavagetto) [11:01:48] (03CR) 10Elukey: [C: 031] toollabs: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310781 (owner: 10Giuseppe Lavagetto) [11:02:19] (03CR) 10Elukey: [C: 031] standard::mail::sender: move template to module [puppet] - 10https://gerrit.wikimedia.org/r/310782 (owner: 10Giuseppe Lavagetto) [11:02:28] hmm [11:03:19] (03CR) 10Elukey: [C: 031] role::mha: move templates to module [puppet] - 10https://gerrit.wikimedia.org/r/310783 (owner: 10Giuseppe Lavagetto) [11:03:51] warning... [11:03:53] sigh [11:03:57] damn puppet-lint [11:05:01] (03PS1) 10Alexandros Kosiaris: postgres: Fix puppet-lint error [puppet] - 10https://gerrit.wikimedia.org/r/310801 [11:06:00] (03CR) 10Elukey: [C: 031] "Checked all the deleted files, nothing references them in puppet. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/310784 (owner: 10Giuseppe Lavagetto) [11:06:31] (03CR) 10Elukey: [C: 031] "Not referenced anymore, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/310785 (owner: 10Giuseppe Lavagetto) [11:06:59] (03CR) 10jenkins-bot: [V: 04-1] templates: add large warning sign not to add anything anymore [puppet] - 10https://gerrit.wikimedia.org/r/310789 (owner: 10Giuseppe Lavagetto) [11:07:38] (03CR) 10jenkins-bot: [V: 04-1] Also handle systemd in keyholder script [puppet] - 10https://gerrit.wikimedia.org/r/310793 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [11:07:50] (03CR) 10Elukey: [C: 031] templates: move squid3 to role module [puppet] - 10https://gerrit.wikimedia.org/r/310786 (owner: 10Giuseppe Lavagetto) [11:07:52] (03CR) 10jenkins-bot: [V: 04-1] admin: update hashar gdbinit script [puppet] - 10https://gerrit.wikimedia.org/r/310794 (owner: 10Hashar) [11:08:23] (03CR) 10Elukey: [C: 031] url_downloader: move templates to role module [puppet] - 10https://gerrit.wikimedia.org/r/310788 (owner: 10Giuseppe Lavagetto) [11:08:31] !log CI / Jenkins is starved. Investigating [11:08:33] PROBLEM - Apache HTTP on mw1251 is CRITICAL: Connection refused [11:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:08:45] PROBLEM - Disk space on labsdb1007 is CRITICAL: DISK CRITICAL - free space: /srv/postgres 0 MB (0% inode=99%) [11:08:52] (03CR) 10jenkins-bot: [V: 04-1] Remove mediawiki02 from deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/310796 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [11:09:44] (03CR) 10jenkins-bot: [V: 04-1] check_postgres_replication_lag.py: Rewrite parts of it [puppet] - 10https://gerrit.wikimedia.org/r/310772 (owner: 10Alexandros Kosiaris) [11:09:54] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] postgres: Fix puppet-lint error [puppet] - 10https://gerrit.wikimedia.org/r/310801 (owner: 10Alexandros Kosiaris) [11:10:38] mw1251 is a reimaged host [11:11:04] !log CI is catching up. It is starved processing a long serie of dependent changes in Gerrit [11:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:11:52] (03PS4) 10Alexandros Kosiaris: check_postgres_replication_lag.py: Rewrite parts of it [puppet] - 10https://gerrit.wikimedia.org/r/310772 [11:12:52] akosiaris: when we can start apertium jessie testing? [11:13:46] RECOVERY - Apache HTTP on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.815 second response time [11:14:28] (03PS2) 10Giuseppe Lavagetto: installserver: move squid template to modules [puppet] - 10https://gerrit.wikimedia.org/r/310777 [11:14:56] kart_: maybe tomorrow ? [11:15:27] (03CR) 10Giuseppe Lavagetto: "yes, of course, but I'll just these two toghether." [puppet] - 10https://gerrit.wikimedia.org/r/310778 (owner: 10Giuseppe Lavagetto) [11:16:08] 06Operations, 06Performance-Team, 10Thumbor: djvu failure for very high page number - https://phabricator.wikimedia.org/T145616#2640150 (10Gilles) While the djvu issue is real, I've just discovered that the 504 error above is a different issue, that happens before thumbor even tries to process the image. It... [11:16:55] PROBLEM - puppet last run on mw2139 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:17:25] akosiaris: sounds goood. Thanks. [11:17:51] (03CR) 10Giuseppe Lavagetto: [C: 032] installserver: move squid template to modules [puppet] - 10https://gerrit.wikimedia.org/r/310777 (owner: 10Giuseppe Lavagetto) [11:18:07] (03PS2) 10Giuseppe Lavagetto: graphite: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310778 [11:18:29] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] graphite: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310778 (owner: 10Giuseppe Lavagetto) [11:18:54] RECOVERY - Disk space on labsdb1007 is OK: DISK OK [11:20:05] (03Draft2) 10MarcoAurelio: Enable $ExtraSignatureNamespaces for all namespaces for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310802 (https://phabricator.wikimedia.org/T145619) [11:20:26] (03PS2) 10Giuseppe Lavagetto: role::logging: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310779 [11:20:38] (03CR) 10jenkins-bot: [V: 04-1] role::logging: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310779 (owner: 10Giuseppe Lavagetto) [11:20:44] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310802 (https://phabricator.wikimedia.org/T145619) (owner: 10MarcoAurelio) [11:20:53] _joe_: that failure is me [11:20:59] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/310779 (owner: 10Giuseppe Lavagetto) [11:21:32] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 41 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[pg_basebackup-labsdb1006.eqiad.wmnet] [11:21:35] <_joe_> yeah I imagined that [11:21:46] <_joe_> akosiaris: ^^ [11:21:50] <_joe_> that you? [11:21:53] yup [11:22:30] (03CR) 10Giuseppe Lavagetto: [C: 032] role::logging: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310779 (owner: 10Giuseppe Lavagetto) [11:25:23] (03CR) 10Giuseppe Lavagetto: [C: 032] kibana: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310780 (owner: 10Giuseppe Lavagetto) [11:25:38] (03PS2) 10Giuseppe Lavagetto: kibana: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310780 [11:26:15] (03CR) 10Giuseppe Lavagetto: [V: 032] kibana: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310780 (owner: 10Giuseppe Lavagetto) [11:27:27] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't load source files bigger than 100MB - https://phabricator.wikimedia.org/T145768#2640194 (10Gilles) [11:27:54] (03CR) 10Alexandros Kosiaris: [C: 032] check_postgres_replication_lag.py: Rewrite parts of it [puppet] - 10https://gerrit.wikimedia.org/r/310772 (owner: 10Alexandros Kosiaris) [11:27:58] (03PS5) 10Alexandros Kosiaris: check_postgres_replication_lag.py: Rewrite parts of it [puppet] - 10https://gerrit.wikimedia.org/r/310772 [11:28:00] (03CR) 10Alexandros Kosiaris: [V: 032] check_postgres_replication_lag.py: Rewrite parts of it [puppet] - 10https://gerrit.wikimedia.org/r/310772 (owner: 10Alexandros Kosiaris) [11:28:13] (03PS2) 10Giuseppe Lavagetto: toollabs: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310781 [11:28:34] elukey, volans: reimage of mw1251 worked fine and it's totally awesome, thanks Riccardo! [11:29:21] glad it helps :) [11:29:27] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] toollabs: move templates to role [puppet] - 10https://gerrit.wikimedia.org/r/310781 (owner: 10Giuseppe Lavagetto) [11:29:34] there is still a lot of room for improvements in the long term [11:31:55] 06Operations, 06Multimedia, 10TimedMediaHandler, 07HHVM: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#2640217 (10MoritzMuehlenhoff) [11:34:37] PROBLEM - Postgres Replication Lag on nihal is CRITICAL: CRITICAL: Master reports slave not active [11:34:45] (03PS2) 10Giuseppe Lavagetto: standard::mail::sender: move template to module [puppet] - 10https://gerrit.wikimedia.org/r/310782 [11:34:55] <_joe_> akosiaris: \0/ [11:34:56] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:35:02] <_joe_> uhm [11:35:24] <_joe_> that's ^^ not true [11:35:32] ACKNOWLEDGEMENT - Postgres Replication Lag on nihal is CRITICAL: CRITICAL: Master reports slave not active alexandros kosiaris still working on it [11:37:36] (03CR) 10Giuseppe Lavagetto: [C: 032] standard::mail::sender: move template to module [puppet] - 10https://gerrit.wikimedia.org/r/310782 (owner: 10Giuseppe Lavagetto) [11:39:33] _joe_: actually it is... [11:39:43] 06Operations, 06Performance-Team, 10Thumbor: djvu failure for very high page number - https://phabricator.wikimedia.org/T145616#2640237 (10Gilles) [11:40:22] <_joe_> akosiaris: the puppetmaster failure? [11:40:41] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't load source files bigger than 100MB - https://phabricator.wikimedia.org/T145768#2640239 (10Gilles) Turning HTTP_LOADER_CURL_ASYNC_HTTP_CLIENT on resulted in catastrophic failure after a short time: ``` Sep 15 11:39:56 thumbor1001 thumbor@8814[6424... [11:40:43] the replication issue on nihal [11:40:58] the puppetmaster failure... hmm [11:41:33] message: "Caught INT; exiting" [11:41:41] that's what that failure is [11:41:43] and it's me [11:41:45] heh... [11:41:51] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't load source files bigger than 100MB - https://phabricator.wikimedia.org/T145768#2640242 (10Gilles) Actually, I see now that it's a single instance stuck in that state. Maybe not related to that config change I'm trying? Although it wasn't happening b... [11:42:20] (03PS3) 10Elukey: Remove mediawiki02 from deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/310796 (https://phabricator.wikimedia.org/T144006) [11:43:13] RECOVERY - puppet last run on mw2139 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:43:14] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:43:35] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mha: move templates to module [puppet] - 10https://gerrit.wikimedia.org/r/310783 (owner: 10Giuseppe Lavagetto) [11:43:38] 06Operations, 10DBA, 10MediaWiki-Page-deletion, 07Performance: Cannot delete two pages with large histories even having the appropriate permissions to do so - https://phabricator.wikimedia.org/T145630#2640244 (10MarcoAurelio) Thank you. Is it possible to grant more limits to stewards when performing bigdel... [11:43:40] (03PS2) 10Giuseppe Lavagetto: role::mha: move templates to module [puppet] - 10https://gerrit.wikimedia.org/r/310783 [11:43:55] (03CR) 10Elukey: [C: 032] Remove mediawiki02 from deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/310796 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [11:44:50] (03PS2) 10Muehlenhoff: Also handle systemd in keyholder script [puppet] - 10https://gerrit.wikimedia.org/r/310793 (https://phabricator.wikimedia.org/T144578) [11:45:21] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't load source files bigger than 100MB - https://phabricator.wikimedia.org/T145768#2640258 (10Gilles) I'm restarting the thumbor instances and waiting to see if that crazy error reappears. [11:45:43] (03PS1) 10Alexandros Kosiaris: postgres: Fix typo in replication lag check [puppet] - 10https://gerrit.wikimedia.org/r/310808 [11:46:14] (03CR) 10Muehlenhoff: "The "rsa w/o comment" is a regression in openssh 6.7, which got fixed in 6.8, I've added a comment to the commit msg. The agent status is " [puppet] - 10https://gerrit.wikimedia.org/r/310793 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [11:46:35] PROBLEM - puppet last run on oresrdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:46:52] 06Operations, 06Performance-Team, 10Thumbor: Archive file thumbs not working - https://phabricator.wikimedia.org/T145769#2640259 (10Gilles) [11:48:03] 06Operations, 06Performance-Team, 10Thumbor: Archive file thumbs not working - https://phabricator.wikimedia.org/T145769#2640274 (10Gilles) [11:48:35] <_joe_> a few criticals might happen because of race conditions during catalog compilation, but they should go away [11:48:43] 06Operations, 06Performance-Team, 10Thumbor: Archive file thumbs not working - https://phabricator.wikimedia.org/T145769#2640259 (10Gilles) [11:49:00] (03CR) 10Alexandros Kosiaris: [C: 032] postgres: Fix typo in replication lag check [puppet] - 10https://gerrit.wikimedia.org/r/310808 (owner: 10Alexandros Kosiaris) [11:49:13] RECOVERY - puppet last run on oresrdb1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [11:49:34] (03PS3) 10Giuseppe Lavagetto: role::mha: move templates to module [puppet] - 10https://gerrit.wikimedia.org/r/310783 [11:49:37] (03CR) 10Giuseppe Lavagetto: [V: 032] role::mha: move templates to module [puppet] - 10https://gerrit.wikimedia.org/r/310783 (owner: 10Giuseppe Lavagetto) [11:51:49] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't load source files bigger than 100MB - https://phabricator.wikimedia.org/T145768#2640280 (10Gilles) So far so good regarding the crazy error, it's not happening anymore. Maybe it had to do with restarting the thumbor instances? [11:52:59] (03CR) 10Giuseppe Lavagetto: [C: 032] templates: clean up templates/misc [puppet] - 10https://gerrit.wikimedia.org/r/310784 (owner: 10Giuseppe Lavagetto) [11:53:04] (03PS2) 10Giuseppe Lavagetto: templates: clean up templates/misc [puppet] - 10https://gerrit.wikimedia.org/r/310784 [11:53:40] (03CR) 10Giuseppe Lavagetto: [V: 032] templates: clean up templates/misc [puppet] - 10https://gerrit.wikimedia.org/r/310784 (owner: 10Giuseppe Lavagetto) [11:54:35] (03PS2) 10Giuseppe Lavagetto: templates: add large warning sign not to add anything anymore [puppet] - 10https://gerrit.wikimedia.org/r/310789 [11:54:37] (03CR) 10Hashar: "Cherry picked on puppetmaster. mira02 shows:" [puppet] - 10https://gerrit.wikimedia.org/r/310793 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [11:56:22] hashar: I'm gonna +2 it now for CI :) [11:58:04] (03PS1) 10Alexandros Kosiaris: postgres: Fix recovery.conf dependency [puppet] - 10https://gerrit.wikimedia.org/r/310811 [11:58:34] (03CR) 10Alexandros Kosiaris: [C: 032] postgres: Fix recovery.conf dependency [puppet] - 10https://gerrit.wikimedia.org/r/310811 (owner: 10Alexandros Kosiaris) [11:58:38] (03PS2) 10Alexandros Kosiaris: postgres: Fix recovery.conf dependency [puppet] - 10https://gerrit.wikimedia.org/r/310811 [11:58:41] (03CR) 10Alexandros Kosiaris: [V: 032] postgres: Fix recovery.conf dependency [puppet] - 10https://gerrit.wikimedia.org/r/310811 (owner: 10Alexandros Kosiaris) [12:00:04] addshore and hashar: Respected human, time to deploy RevisionSlider (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160915T1200). Please do the needful. [12:00:04] Addshore: A patch you scheduled for RevisionSlider is about to be deployed. Please be available during the process. [12:00:52] o/ [12:08:43] (03CR) 10Hashar: [C: 031] "Moritz pointed to the openssh patch https://github.com/openssh/openssh-portable/commit/1195f4cb07ef4b0405c839293c38600b3e9bdb46 which is t" [puppet] - 10https://gerrit.wikimedia.org/r/310793 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [12:08:48] damn dependencies [12:13:14] hashar: syncing now [12:13:19] !log addshore@tin Started scap: [[gerrit:310751|Update RevisionSlider i18n]] [12:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:13:37] addshore: go go go [12:18:37] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2640339 (10Gilles) If you want to give it a try, the file is https://upload.wikimedia.org/wikipedia/commons/a/ab/Borsch_01.webm And the command is: ``` ffmpeg -ss 82 -i https://upload.wiki... [12:23:34] addshore: I am out for a few minutes. Need to grab a snack nearby [12:23:46] kk! [12:25:22] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2640344 (10Gilles) And without surprise, specifically asking for the first frame works in production: https://upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Borsch_01.webm/320px-seek=0-Bo... [12:27:58] addshore: how is the whole scap going on ? [12:28:10] on canaries now! [12:28:36] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:33:21] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2640348 (10Gilles) [12:35:50] 06Operations, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#2640353 (10hashar) [12:37:02] (03PS1) 10Alexandros Kosiaris: postgres: Introduce postgresql::dirs [puppet] - 10https://gerrit.wikimedia.org/r/310812 [12:38:41] (03PS1) 10Gilles: Enable Tornado async curl loader for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/310813 [12:41:58] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't load source files bigger than 100MB - https://phabricator.wikimedia.org/T145768#2640357 (10Gilles) [12:43:45] !log addshore@tin Finished scap: [[gerrit:310751|Update RevisionSlider i18n]] (duration: 30m 26s) [12:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:43:56] hashar: donme [12:44:50] just in time [12:46:33] (03PS1) 10Alexandros Kosiaris: openldap: Add a retry syncrepl parameter [puppet] - 10https://gerrit.wikimedia.org/r/310815 [12:47:00] (03CR) 10Alexandros Kosiaris: [C: 032] postgres: Introduce postgresql::dirs [puppet] - 10https://gerrit.wikimedia.org/r/310812 (owner: 10Alexandros Kosiaris) [12:47:54] hashar: im surrounded by people speaking french! [12:48:53] https://usercontent.irccloud-cdn.com/file/knWIiZOA/IMG_20160915_142137.jpg [12:50:12] 06Operations, 10DBA: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676#2640375 (10Marostegui) ``` db1053.eqiad.wmnet MariaDB PRODUCTION s1 localhost enwiki > rename table email_capture to TO_DROP_email_capture; Query OK, 0 rows affected (0.23 sec) db1044.eqia... [12:53:39] addshore: what a beautiful view :} [12:54:41] jouncebot: neilpquinn [12:54:44] jouncebot: next [12:54:44] In 0 hour(s) and 5 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160915T1300) [12:55:00] zeljkof: wanna handle the SWAT? [12:55:31] (03CR) 10Hashar: [C: 031] Update outdated comment for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306161 (owner: 10Nikerabbit) [12:55:34] my patches are rather trivial (then again, usually bugs are hidden in changes thought to be trivial) [12:55:34] (03PS1) 10Alexandros Kosiaris: Postgres: Remove unneeded dependency [puppet] - 10https://gerrit.wikimedia.org/r/310817 [12:55:39] hashar: sure, let me check what is there [12:55:52] (03CR) 10Hashar: [C: 031] Remove $wgTranslateEC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306178 (owner: 10Nikerabbit) [12:55:54] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Postgres: Remove unneeded dependency [puppet] - 10https://gerrit.wikimedia.org/r/310817 (owner: 10Alexandros Kosiaris) [12:56:04] zeljkof: couple straightforward changes for Nikerabbit :} [12:56:17] akosiaris: Remove unneeded dependency [puppet] [12:56:30] akosiaris: that grrrit-wm notification message ends up being confusing :} [12:57:27] hashar, Nikerabbit: ok, looking at commits in gerrit... [12:58:02] RECOVERY - Postgres Replication Lag on nihal is OK: OK - Rep Delay is: 0.0 Seconds [12:58:18] hashar ? [12:58:28] Nikerabbit: ready? [12:58:38] I even specified the puppet module ... [12:58:53] _joe_: finally nihal is up and running and the check is working way better than before [12:59:16] 06Operations, 06Release-Engineering-Team, 07HHVM: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2640395 (10hashar) >>! In T144578#2637243, @AlexMonk-WMF wrote: >>>! In T144578#2637063, @MoritzMuehlenhoff wrote: >> I've added a new deployment server mira02 > > I... [13:00:04] hashar, Dereckson, addshore, and aude: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160915T1300). [13:00:04] Nikerabbit: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:28] 06Operations, 10ops-eqiad, 10DBA: db1082 hardware check - https://phabricator.wikimedia.org/T145607#2640397 (10Cmjohnson) performed a memtest, test came back with zero errors [13:00:30] zeljkof: yep [13:00:51] Nikerabbit: great, I can SWAT today! :D [13:01:09] Nikerabbit: is there anything to test after the patches are deployed? [13:01:30] can you check at mw1099 if anything went wrong? [13:01:33] 06Operations, 06Performance-Team, 10Thumbor: thumbor ffmpeg pipe deadlock - https://phabricator.wikimedia.org/T145626#2640401 (10Gilles) [13:01:48] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2640402 (10Cmjohnson) an HP tech came yesterday but once here realized that HP sent him a new backplane for the front disks and not the ssds. He did add a new ssd into slot 13. Once th... [13:02:02] zeljkof: there is zero reference to the variable in Translate, so only place to see something would be in the form of undefined variable in the logs I believe [13:02:21] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2640407 (10jcrespo) [13:02:25] Nikerabbit: ok, then checking the logs would let us know [13:03:06] if you wish I can do quick sanity check on mw1099 [13:03:25] Nikerabbit: please do, I am new to deployments :) [13:04:01] zeljkof: okay, have you placed the code there? [13:04:11] sorry, not yet [13:04:35] Nikerabbit: gerrit says cannot merge for both patches, I should rebase first? [13:05:06] yeah aren't they usually rebased before swat? [13:05:13] let me know if there are conflicts [13:05:37] Nikerabbit: usually does not mean a lot to me, this is my third or fourth deploy :) [13:05:51] (03PS2) 10Zfilipin: Remove $wgTranslateEC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306178 (owner: 10Nikerabbit) [13:06:00] (03PS2) 10Zfilipin: Update outdated comment for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306161 (owner: 10Nikerabbit) [13:06:13] rebasing... [13:07:04] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306178 (owner: 10Nikerabbit) [13:07:23] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306161 (owner: 10Nikerabbit) [13:07:32] no conflicts, +2d both patches [13:07:36] (03Merged) 10jenkins-bot: Remove $wgTranslateEC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306178 (owner: 10Nikerabbit) [13:08:29] one will need a rebase again :D [13:08:32] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2640411 (10ema) The varnish error triggering this seems to be: ```-- FetchError straight insufficient bytes``` Full varnishlog here: https://phabricator.wiki... [13:09:09] hashar: mediawiki-confag is fast-forward only? [13:09:34] Nikerabbit: ya! [13:10:36] (03PS3) 10Zfilipin: Update outdated comment for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306161 (owner: 10Nikerabbit) [13:10:48] (03CR) 10Zfilipin: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306161 (owner: 10Nikerabbit) [13:11:01] 06Operations, 10ops-eqiad, 10hardware-requests: reclaim WMF4724 to spares - https://phabricator.wikimedia.org/T142412#2640413 (10Cmjohnson) [13:11:09] hashar: ouch, noticed that :) done [13:11:10] 06Operations, 10ops-eqiad, 10hardware-requests: reclaim WMF4724 to spares - https://phabricator.wikimedia.org/T142412#2534075 (10Cmjohnson) 05Open>03Resolved [13:11:28] zeljkof: cause when one merge, the other is no more against tip of branch [13:11:36] so what I do is that I usually cherry pick them one by one locally [13:11:39] git-review -x 1234 [13:11:50] git-review -x 1235 [13:11:51] git-review -x 1236 [13:11:56] git-review [13:12:03] hashar: yes, I have thought of that _after_ the fact :) [13:12:08] that reorder them in a dependency chain on tip of branch. Then CR+2 all of them [13:12:30] (03CR) 10Zfilipin: Update outdated comment for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306161 (owner: 10Nikerabbit) [13:12:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306161 (owner: 10Nikerabbit) [13:13:22] (03Merged) 10jenkins-bot: Update outdated comment for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306161 (owner: 10Nikerabbit) [13:13:43] 06Operations, 07Puppet, 13Patch-For-Review: Puppetize ircyall & set up instance appropriately - https://phabricator.wikimedia.org/T1357#2640426 (10hashar) 05Open>03declined No more used per @yuvipanda and related material is being removed from puppet ( https://gerrit.wikimedia.org/r/310692 ) [13:14:14] ok, both patches merged [13:14:33] all logs opened in tabs/terminal [13:15:32] (03PS2) 10Hashar: admin: update hashar gdbinit script [puppet] - 10https://gerrit.wikimedia.org/r/310794 [13:18:16] Nikerabbit: ok, both commits are at mw1099 [13:18:21] can you take a quick look? [13:18:42] zeljkof: un momento, just killed my browser when it froze again [13:19:14] (03PS1) 10Elukey: Add mediawiki05 to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/310818 (https://phabricator.wikimedia.org/T144006) [13:19:30] no problem, I am looking at the logs, just in case, not that I know what to look for :) [13:19:35] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2640435 (10ema) cp4007 has been affected by this issue yesterday 2016-09-14 between ~21:05 and ~22:30 and again today 2016-09-15 between ~10:55 and ~11:46. {F44... [13:19:52] zeljkof: I can do translations fine [13:20:16] Nikerabbit: ok, ready to blast the patches across the know universe? [13:20:35] zeljkof: yeah, feel free to expand to the unknown as well [13:20:46] (03PS3) 10Gehel: graphite - add tests to configparser_format [puppet] - 10https://gerrit.wikimedia.org/r/309013 [13:21:01] Nikerabbit: will do, black matter, black energy, black beer... [13:21:24] (03PS1) 10Filippo Giunchedi: [WIP] prometheus: generate varnish targets from conftool [puppet] - 10https://gerrit.wikimedia.org/r/310819 [13:21:38] zeljkof: (thumbsup) [13:22:23] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [13:22:25] (03CR) 10Gehel: [C: 032] graphite - add tests to configparser_format [puppet] - 10https://gerrit.wikimedia.org/r/309013 (owner: 10Gehel) [13:23:16] (03PS3) 10Gehel: wdqs - send notifications of WDQS lag also to wdqs-admins group [puppet] - 10https://gerrit.wikimedia.org/r/310748 (https://phabricator.wikimedia.org/T144948) [13:24:55] (03CR) 10Gehel: [C: 032] wdqs - send notifications of WDQS lag also to wdqs-admins group [puppet] - 10https://gerrit.wikimedia.org/r/310748 (https://phabricator.wikimedia.org/T144948) (owner: 10Gehel) [13:25:44] !log zfilipin@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:306178|Remove $wgTranslateEC]] (duration: 00m 48s) [13:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:26:03] https://gerrit.wikimedia.org/r/#/c/306178/ deployed [13:26:14] no errors [13:26:51] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: some icinga checks on WDQS do not send notifications - https://phabricator.wikimedia.org/T144948#2615203 (10Gehel) 05Open>03Resolved the appropriate contact group has been added to the WDQS lag check. [13:28:10] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:306161|Update outdated comment for Wikibase]] (duration: 00m 48s) [13:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:28:16] (03CR) 10Hashar: [C: 031] Add mediawiki05 to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/310818 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [13:28:22] https://gerrit.wikimedia.org/r/#/c/306161/ deployed [13:28:24] no errors [13:28:37] no scap errors, I mean [13:28:49] 06Operations, 10ops-eqiad, 10DBA: db1082 hardware check - https://phabricator.wikimedia.org/T145607#2640460 (10Marostegui) Thanks Chris - I will close this ticket and we will keep updating the upstream. [13:29:11] Nikerabbit: ok, can you check the real thing(tm)? [13:29:23] if all is fine, eu swat is done [13:29:29] hashar: ^ [13:29:53] (03PS2) 10Elukey: Add mediawiki05 to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/310818 (https://phabricator.wikimedia.org/T144006) [13:30:00] zeljkof: congratulations :} [13:30:20] zeljkof: the real thing feels warm and fuzzy [13:30:29] hashar: uh, my first swat done alone, hopefully nothing broken beyond repair [13:30:41] Nikerabbit: great news! [13:30:49] checking deployments page one more time... [13:30:56] zeljkof: next thing will be riding the mw train [13:31:04] hashar: sure... [13:31:06] and I might ask to get it done during Europe business hours [13:31:15] instead of 9pm cest [13:31:22] no new patches for eu swat [13:31:28] !log EU SWAT done! [13:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:31:43] thank you morebots [13:33:04] (03CR) 10Mobrovac: [C: 031] Add mediawiki05 to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/310818 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [13:33:08] (03CR) 10Elukey: [C: 032] Add mediawiki05 to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/310818 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [13:35:34] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet runs fails randomly on deployment-prep / beta cluster hosts - https://phabricator.wikimedia.org/T145631#2640466 (10hashar) Did a grep of `error` on all /var/log/puppet.log `root@deployment-salt02:~# salt -v '*' cmd.run 'grep -i error /var/log/puppet.log'` **dep... [13:36:10] PROBLEM - puppet last run on db2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:40:31] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2640476 (10Marostegui) After the memtest (no errors found) the server is back and catching up with the master. Once it caught up, we will pool it back and slowly give it some weight in the LB. [13:46:50] (03PS2) 10Andrew Bogott: Puppet Panel: Actually populate the prefix panel with prefixes. [puppet] - 10https://gerrit.wikimedia.org/r/310718 (https://phabricator.wikimedia.org/T91990) [13:50:29] PROBLEM - Disk space on thumbor1001 is CRITICAL: DISK CRITICAL - free space: / 1467 MB (3% inode=97%) [13:52:31] (03PS3) 10Muehlenhoff: Also handle systemd in keyholder script [puppet] - 10https://gerrit.wikimedia.org/r/310793 (https://phabricator.wikimedia.org/T144578) [13:52:57] just when I thought I had mastered git subcommands and options [13:53:20] I found git-extras moaaar commands to learn about https://github.com/tj/git-extras (9mins video showcase: https://vimeo.com/45506445 ) [13:54:43] godog: ^^^ 38G /var/log/thumbor/thumbor.log [13:55:12] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.107, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:55:29] PROBLEM - AQS root url on aqs1004 is CRITICAL: Connection refused [13:55:41] full of tornado's tracebacks [13:56:18] this is me --^ [13:56:24] not in production, we are testing code [13:56:29] elukey: AQS? [13:56:36] yep [13:56:46] I'm more worried about thumbor1001 [13:56:53] do you know anything abou tit? [13:58:22] volans: thanks, I'm in a meeting with gilles and will take a look shortly [13:58:41] godog: ok, / is already full FYI [13:59:52] ack, FWIW thumbor results are not yet served to users [14:00:21] right [14:02:49] RECOVERY - Disk space on thumbor1001 is OK: DISK OK [14:03:08] !log empty big log file on thumbor1001 /var/log/thumbor/thumbor.log [14:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:05] (03Abandoned) 10Gilles: Enable Tornado async curl loader for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/310813 (owner: 10Gilles) [14:05:51] RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [14:09:22] PROBLEM - puppet last run on wtp2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:11:41] PROBLEM - mediawiki-installation DSH group on mw1250 is CRITICAL: Host mw1250 is not in mediawiki-installation dsh group [14:12:12] (03CR) 10Muehlenhoff: [C: 032] Also handle systemd in keyholder script [puppet] - 10https://gerrit.wikimedia.org/r/310793 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [14:14:09] PROBLEM - puppet last run on mw2119 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:18:22] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [14:19:26] fixing mw1250 [14:19:30] RECOVERY - AQS root url on aqs1004 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.010 second response time [14:22:34] (03PS3) 10Andrew Bogott: Puppet Panel: Actually populate the prefix panel with prefixes. [puppet] - 10https://gerrit.wikimedia.org/r/310718 (https://phabricator.wikimedia.org/T91990) [14:23:17] 06Operations, 06Performance-Team, 10Thumbor: Extremely noisy ffmpeg errors - https://phabricator.wikimedia.org/T145612#2640577 (10MoritzMuehlenhoff) Ok, I'll test that with libvpx from backports tomorrow. [14:23:20] (03PS4) 10Andrew Bogott: Puppet Panel: Actually populate the prefix panel with prefixes. [puppet] - 10https://gerrit.wikimedia.org/r/310718 (https://phabricator.wikimedia.org/T91990) [14:24:02] (03PS1) 10Giuseppe Lavagetto: puppetmaster: throw away reports [puppet] - 10https://gerrit.wikimedia.org/r/310822 [14:24:27] <_joe_> volans: ^^ [14:24:39] (03CR) 10Andrew Bogott: [C: 032] Puppet Panel: Actually populate the prefix panel with prefixes. [puppet] - 10https://gerrit.wikimedia.org/r/310718 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [14:26:05] (03PS2) 10Giuseppe Lavagetto: puppetmaster: throw away reports [puppet] - 10https://gerrit.wikimedia.org/r/310822 [14:26:11] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: throw away reports [puppet] - 10https://gerrit.wikimedia.org/r/310822 (owner: 10Giuseppe Lavagetto) [14:30:52] <_joe_> !log removing old reports from the puppet directory [14:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:17] (03CR) 10Giuseppe Lavagetto: [C: 032] templates: purge templates/mysql, unused [puppet] - 10https://gerrit.wikimedia.org/r/310785 (owner: 10Giuseppe Lavagetto) [14:32:24] (03PS2) 10Giuseppe Lavagetto: templates: purge templates/mysql, unused [puppet] - 10https://gerrit.wikimedia.org/r/310785 [14:32:28] (03CR) 10Giuseppe Lavagetto: [V: 032] templates: purge templates/mysql, unused [puppet] - 10https://gerrit.wikimedia.org/r/310785 (owner: 10Giuseppe Lavagetto) [14:34:50] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:35:02] (03PS2) 10Muehlenhoff: Add mira02 to dsh groups for labs [puppet] - 10https://gerrit.wikimedia.org/r/310752 (https://phabricator.wikimedia.org/T144578) [14:35:32] (03PS2) 10Giuseppe Lavagetto: templates: move squid3 to role module [puppet] - 10https://gerrit.wikimedia.org/r/310786 [14:35:57] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] templates: move squid3 to role module [puppet] - 10https://gerrit.wikimedia.org/r/310786 (owner: 10Giuseppe Lavagetto) [14:36:34] (03PS2) 10Giuseppe Lavagetto: udp2log: move templates to the proper position [puppet] - 10https://gerrit.wikimedia.org/r/310787 [14:37:20] (03PS3) 10Muehlenhoff: Add mira02 to dsh groups for labs [puppet] - 10https://gerrit.wikimedia.org/r/310752 (https://phabricator.wikimedia.org/T144578) [14:39:00] (03CR) 10Andrew Bogott: [C: 032] labtest hiera: use labtestwikitech, not wikitech [puppet] - 10https://gerrit.wikimedia.org/r/309706 (owner: 10Alex Monk) [14:39:04] (03PS2) 10Andrew Bogott: labtest hiera: use labtestwikitech, not wikitech [puppet] - 10https://gerrit.wikimedia.org/r/309706 (owner: 10Alex Monk) [14:39:23] (03CR) 10Hashar: [C: 031] Add mira02 to dsh groups for labs [puppet] - 10https://gerrit.wikimedia.org/r/310752 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [14:39:33] (03CR) 10Muehlenhoff: [C: 032] Add mira02 to dsh groups for labs [puppet] - 10https://gerrit.wikimedia.org/r/310752 (https://phabricator.wikimedia.org/T144578) (owner: 10Muehlenhoff) [14:39:38] RECOVERY - puppet last run on mw2119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:40:27] (03CR) 10Giuseppe Lavagetto: [C: 032] udp2log: move templates to the proper position [puppet] - 10https://gerrit.wikimedia.org/r/310787 (owner: 10Giuseppe Lavagetto) [14:40:37] (03PS3) 10Giuseppe Lavagetto: udp2log: move templates to the proper position [puppet] - 10https://gerrit.wikimedia.org/r/310787 [14:41:01] (03CR) 10Giuseppe Lavagetto: [V: 032] udp2log: move templates to the proper position [puppet] - 10https://gerrit.wikimedia.org/r/310787 (owner: 10Giuseppe Lavagetto) [14:41:41] (03PS3) 10Andrew Bogott: labtest hiera: use labtestwikitech, not wikitech [puppet] - 10https://gerrit.wikimedia.org/r/309706 (owner: 10Alex Monk) [14:43:24] <_joe_> ottomata: the kafka and confluent kafka classes raise all sorts of warnings and errors [14:43:33] (03PS1) 10Cmjohnson: Removing mgmt dns entries for server argon and wmf4074 for decommissioning. [dns] - 10https://gerrit.wikimedia.org/r/310825 [14:43:34] <_joe_> ottomata: please search those in the syslog of the puppetmaster [14:43:38] 06Operations, 06Performance-Team, 10Thumbor: Archive file thumbs not working - https://phabricator.wikimedia.org/T145769#2640664 (10Gilles) a:05fgiunchedi>03Gilles [14:43:39] <_joe_> and fix them [14:44:03] (03PS2) 10Giuseppe Lavagetto: url_downloader: move templates to role module [puppet] - 10https://gerrit.wikimedia.org/r/310788 [14:44:26] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] url_downloader: move templates to role module [puppet] - 10https://gerrit.wikimedia.org/r/310788 (owner: 10Giuseppe Lavagetto) [14:46:08] (03PS3) 10Giuseppe Lavagetto: templates: add large warning sign not to add anything anymore [puppet] - 10https://gerrit.wikimedia.org/r/310789 [14:46:15] 06Operations, 06Performance-Team, 10Thumbor: Separate 404s into their own log - https://phabricator.wikimedia.org/T145632#2640693 (10Gilles) a:05Gilles>03fgiunchedi TODO: lower log level of swift 404 itself [14:46:47] _joe_: ok [14:47:49] all sorts _joe_? [14:47:55] (03PS4) 10Giuseppe Lavagetto: templates: add large warning sign not to add anything anymore [puppet] - 10https://gerrit.wikimedia.org/r/310789 [14:48:19] <_joe_> ottomata: different sorts :P [14:48:42] i see 2 i think, ok! [14:48:52] puppet escapes \ in "" strings? [14:48:58] (03CR) 10Jcrespo: [V: 032] Don't source a custom mysql configuration from /srv/sqldata/my.conf [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310530 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [14:49:16] (03CR) 10Jcrespo: [C: 032 V: 032] Drop the malloc wrapper from mysqld_safe [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310529 (https://phabricator.wikimedia.org/T145378) (owner: 10Muehlenhoff) [14:49:22] (03PS1) 10Ottomata: Fix puppet warnings in confluent module [puppet] - 10https://gerrit.wikimedia.org/r/310826 [14:49:39] 06Operations, 06Performance-Team, 10Thumbor: Report more metrics with statsd - https://phabricator.wikimedia.org/T145784#2640701 (10Gilles) [14:49:59] <_joe_> there are both incorrect templates using 'variable' instead of '@variable' [14:50:03] <_joe_> and some more esoteric [14:50:16] <_joe_> all non-fatals, but still [14:50:50] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for server argon and wmf4074 for decommissioning. [dns] - 10https://gerrit.wikimedia.org/r/310825 (owner: 10Cmjohnson) [14:51:01] (03PS5) 10Giuseppe Lavagetto: templates: add large warning sign not to add anything anymore [puppet] - 10https://gerrit.wikimedia.org/r/310789 [14:52:58] (03PS1) 10Hashar: beta: import scap_masters list from wikitech [puppet] - 10https://gerrit.wikimedia.org/r/310827 [14:53:04] (03CR) 10Giuseppe Lavagetto: [C: 032] templates: add large warning sign not to add anything anymore [puppet] - 10https://gerrit.wikimedia.org/r/310789 (owner: 10Giuseppe Lavagetto) [14:53:29] (03PS2) 10Hashar: beta: import scap_masters list from wikitech [puppet] - 10https://gerrit.wikimedia.org/r/310827 [14:53:46] (03PS3) 10Jcrespo: mariadb: control db1082's mysqld_safe file with puppet [puppet] - 10https://gerrit.wikimedia.org/r/310564 (https://phabricator.wikimedia.org/T145378) [14:54:09] _joe_, we use this to generate backend configs in varnish: name = 'be_' + backend.gsub(/[-.]/, '_') [14:54:43] (03PS4) 10Andrew Bogott: labtest hiera: use labtestwikitech, not wikitech [puppet] - 10https://gerrit.wikimedia.org/r/309706 (owner: 10Alex Monk) [14:54:52] but to refer to them, directors.vcl.tpl.erb uses this: [14:54:54] .backend = be_{{ $parts := split $node "." }}{{ join $parts "_" }}; [14:55:14] which doesn't account for hyphens [14:56:17] (03PS4) 10Jcrespo: mariadb: control db1082's mysqld_safe file with puppet [puppet] - 10https://gerrit.wikimedia.org/r/310564 (https://phabricator.wikimedia.org/T145378) [14:56:18] <_joe_> Krenair: that was good enough in production, of course [14:56:31] yeah, where we happen to have none with hyphens [14:56:34] <_joe_> it's go text/template, pretty awful [14:57:01] !log deployed new-aqs-cluster branch (--rev new-aqs-cluster) to aqs100[456] (new AQS cluster not serving live traffic) [14:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:09] split is a wrapper for go's strings.Split [14:57:51] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't load source files bigger than 100MB - https://phabricator.wikimedia.org/T145768#2640727 (10Gilles) Trying to fix that mysterious epic failure is probably a fool's errand, as it lies in the bowels of Tornado. I think that a better idea could be to wr... [14:57:51] http://stackoverflow.com/a/5120763/1306662 makes it sound like if we had a wrapper for strings.FieldsFunc we could make it work, though we might also need an extra function to use that? [14:57:53] (03CR) 10Hashar: [C: 031] "Dropped from wikitech https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=839131&oldid=839062" [puppet] - 10https://gerrit.wikimedia.org/r/310827 (owner: 10Hashar) [14:58:33] <_joe_> Krenair: you can't really add functions to that template that is being used by confd [14:58:35] 06Operations, 10ops-eqiad: decom antimony (datacenter) - https://phabricator.wikimedia.org/T138978#2640736 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson disk wiped, mgmt dns removed, removed from switch, racktables updated, added to google doc [14:58:54] <_joe_> I mean embed other functions [14:59:28] right, thought that'd be one of the problems [14:59:43] <_joe_> !log starting a noop run on all nodes to puppetmaster2001 to test puppetdb [14:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:00] 06Operations, 06Performance-Team, 10Thumbor: Make the 100MB+ test files downloaded from their source instead of being in the git repo - https://phabricator.wikimedia.org/T145785#2640745 (10Gilles) [15:00:10] hm, I have another idea [15:00:16] (03CR) 10Marostegui: [C: 031] mariadb: control db1082's mysqld_safe file with puppet [puppet] - 10https://gerrit.wikimedia.org/r/310564 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [15:00:54] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2640765 (10fgiunchedi) [15:01:25] (03CR) 10Andrew Bogott: "I don't know what these are/were... are they currently unused?" [puppet] - 10https://gerrit.wikimedia.org/r/309705 (owner: 10Alex Monk) [15:02:01] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2636188 (10fgiunchedi) [15:03:38] 06Operations, 13Patch-For-Review: decom argon - https://phabricator.wikimedia.org/T134223#2640774 (10Cmjohnson) [15:03:40] 06Operations, 10ops-eqiad: decom argon (datacenter) - https://phabricator.wikimedia.org/T134826#2640771 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson server dns entries removed, wiped, removed from rack, switch updated, google doc updated and racktables updated. [15:04:12] template: _etc_varnish_directors.frontend.vcl.tmpl:7: function "replace" not defined [15:04:17] (03PS3) 10Muehlenhoff: beta: import scap_masters list from wikitech [puppet] - 10https://gerrit.wikimedia.org/r/310827 (owner: 10Hashar) [15:04:22] but... https://github.com/kelseyhightower/confd/blob/master/docs/templates.md#replace [15:05:45] <_joe_> Krenair: we're not running the newest version atm [15:06:04] we have 0.9.0-2~ubuntu0 [15:06:32] <_joe_> so yes, that would've been a silver bullet [15:06:34] <_joe_> https://github.com/kelseyhightower/confd/blob/v0.9.0/docs/templates.md [15:06:52] <_joe_> Krenair: I should just build a newer version and ship it [15:07:09] do you think diff could get a worse output? https://gerrit.wikimedia.org/r/#/c/310564/4/manifests/site.pp [15:07:09] (03CR) 10Filippo Giunchedi: "This is an untested WIP, also "live" data from conftool isn't strictly needed since the pooled status doesn't matter (IOW we want metrics " [puppet] - 10https://gerrit.wikimedia.org/r/310819 (owner: 10Filippo Giunchedi) [15:08:17] (03PS5) 10Jcrespo: mariadb: control db1082's mysqld_safe file with puppet [puppet] - 10https://gerrit.wikimedia.org/r/310564 (https://phabricator.wikimedia.org/T145378) [15:09:01] looks like support for replace was added in 27056b9389519e9f1ebf7244f2825a8e008082d6 [15:09:06] Date: Tue Jun 23 13:26:21 2015 +0200 [15:09:08] godog: any idea why some labs instance on beta have the puppet class role::prometheus::node_exporter ? [15:09:20] godog: eg on mira.deployment-prep.eqiad.wmflabs [15:09:25] so would've most likely been part of 0.11.0 [15:09:57] yep, it's listed in the notes for that [15:11:11] (03CR) 10Jcrespo: [C: 032] mariadb: control db1082's mysqld_safe file with puppet [puppet] - 10https://gerrit.wikimedia.org/r/310564 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [15:12:24] hashar: yeah that's part of "monitoring beta with prometheus" in T144502, ideally blanket-applied to all beta instances tho [15:12:24] T144502: deploy prometheus node_exporter and server to deployment-prep - https://phabricator.wikimedia.org/T144502 [15:12:27] godog: I went ahead and added the prometheus class on mira02 :) [15:12:36] heh they added it to solve pretty much the same problem as us: https://github.com/kelseyhightower/confd/issues/305 [15:12:50] godog: I think you can get a class applied on all instance of a project via hiera. Not sure which hiera key though :( [15:13:19] RECOVERY - mediawiki-installation DSH group on mw1250 is OK: OK [15:14:00] <_joe_> Krenair: yes, but I got creative instead [15:14:15] <_joe_> :P [15:14:30] with using split/join for '.'? :) [15:14:49] <_joe_> yes [15:15:58] between submodules and increased number of puppet masters my last deploy took a couple of minutes [15:16:17] <_joe_> jynus: if you mean puppet-merge [15:16:24] yes [15:16:26] <_joe_> it will be better once akosiaris is done merging his changes [15:16:34] oh, I am not complaining at all [15:16:43] there were several patches at once [15:16:53] good timezone all :) [15:17:45] good evening brion [15:18:11] * greg-g waves [15:18:56] something tells me thay my plan to get less involved with mediawiki architecture is not going to be very successful [15:19:16] and I am going to blame me for that [15:19:57] :-) [15:19:57] (03CR) 10Alex Monk: "Looks like they have been since I51c387fb45b0be65da8065496f5c7bb8a9c2b195" [puppet] - 10https://gerrit.wikimedia.org/r/309705 (owner: 10Alex Monk) [15:20:02] haha [15:20:15] jynus: the curse of having expertise and caring too much :D [15:20:16] everyone, please refrain for the next 10 mins from merge puppet patches please, I am updating puppet-merge [15:20:27] brion, more the second than the first [15:21:00] i am totally buying that 'sql antipatterns' book, it looks great [15:21:04] i wish i'd had it in 2001 ;) [15:21:04] (03PS8) 10Alexandros Kosiaris: puppetmaster: Make puppet-merge a template [puppet] - 10https://gerrit.wikimedia.org/r/310302 [15:21:08] * brion time-travels [15:21:10] ha ha [15:21:12] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Make puppet-merge a template [puppet] - 10https://gerrit.wikimedia.org/r/310302 (owner: 10Alexandros Kosiaris) [15:21:23] (03PS8) 10Alexandros Kosiaris: puppetmaster: Delete the post-merge hooks [puppet] - 10https://gerrit.wikimedia.org/r/310303 [15:21:28] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Delete the post-merge hooks [puppet] - 10https://gerrit.wikimedia.org/r/310303 (owner: 10Alexandros Kosiaris) [15:21:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Change the backend forced ssh command [puppet] - 10https://gerrit.wikimedia.org/r/310304 (owner: 10Alexandros Kosiaris) [15:21:46] (03PS9) 10Alexandros Kosiaris: puppetmaster: Change the backend forced ssh command [puppet] - 10https://gerrit.wikimedia.org/r/310304 [15:21:48] (03CR) 10Alexandros Kosiaris: [V: 032] puppetmaster: Change the backend forced ssh command [puppet] - 10https://gerrit.wikimedia.org/r/310304 (owner: 10Alexandros Kosiaris) [15:21:58] (03PS4) 10Alexandros Kosiaris: puppetmaster: Add --quiet option to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/310515 [15:22:02] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Add --quiet option to puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/310515 (owner: 10Alexandros Kosiaris) [15:22:31] jynus: lemme know if you want to chat schema stuff in channel or out, i don't want to spam everybody if y'all are busy with other stuff :D [15:22:43] well, definitely not here [15:22:47] k [15:22:56] join if any #wikimedia-databases [15:23:03] it is still public and logged [15:23:07] but probably more on topic [15:23:15] ok joined :) [15:23:36] brion, give 20 minutes, though [15:23:47] as I was in the middle of an important deployment [15:23:54] will do! [15:23:55] I want to check everthing is ok [15:23:58] :D [15:24:23] well, not important, put potentially impacting [15:24:26] ok i'll poke my software updates on variuos machines in the meantime. ping me when ready [15:24:53] and in fact, I broke things [15:26:02] (03PS1) 10Elukey: Add the new aqs nodes to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/310831 (https://phabricator.wikimedia.org/T124314) [15:26:30] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:26:55] (03CR) 10Alexandros Kosiaris: [C: 032] osm: Also allow postgres connection from postgres slave [puppet] - 10https://gerrit.wikimedia.org/r/310759 (owner: 10Alexandros Kosiaris) [15:26:56] I am stupid, as usual [15:27:00] (03PS2) 10Alexandros Kosiaris: osm: Also allow postgres connection from postgres slave [puppet] - 10https://gerrit.wikimedia.org/r/310759 [15:27:02] (03CR) 10Alexandros Kosiaris: [V: 032] osm: Also allow postgres connection from postgres slave [puppet] - 10https://gerrit.wikimedia.org/r/310759 (owner: 10Alexandros Kosiaris) [15:27:39] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:27:55] (03PS2) 10Elukey: Add the new aqs nodes to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/310831 (https://phabricator.wikimedia.org/T124314) [15:28:49] (03PS1) 10Jcrespo: mysqld_safe: Fix typo when referencing a modules file [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310834 [15:29:11] (03CR) 10Jcrespo: [C: 032] mysqld_safe: Fix typo when referencing a modules file [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/310834 (owner: 10Jcrespo) [15:29:39] (03PS1) 10Filippo Giunchedi: monitoring: validate check_prometheus args [puppet] - 10https://gerrit.wikimedia.org/r/310835 [15:29:40] PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/06-secure-wikimedia.conf] [15:31:00] (03PS1) 10Jcrespo: mariadb: fix typo when referencing module file [puppet] - 10https://gerrit.wikimedia.org/r/310836 [15:31:26] (03CR) 10Jcrespo: [C: 032 V: 032] mariadb: fix typo when referencing module file [puppet] - 10https://gerrit.wikimedia.org/r/310836 (owner: 10Jcrespo) [15:33:21] (03PS1) 10Alexandros Kosiaris: puppet-merge: Remove forced exit 0 [puppet] - 10https://gerrit.wikimedia.org/r/310837 [15:34:26] akosiaris, could it be that your recent changes could break submodules deploys? [15:34:37] <_joe_> jynus: why? [15:34:49] well, it says deployed on pupet merge [15:34:56] but the node doesn't see the change [15:35:07] and the previous change was more verbose [15:35:17] <_joe_> the latter is expected [15:35:18] regarding submodule hook application [15:35:30] jynus: no but please do wait before merging [15:35:52] oh, I just merged [15:36:53] jynus: don't expect it to work then [15:36:59] still got a bug [15:37:01] too late :-) [15:37:18] I am checking the state of the submodule on git [15:37:38] yeah, it has still the old one [15:37:41] I will wait [15:37:43] <_joe_> jynus: it's out of sync [15:37:55] expected... I did ask everyone not to merge [15:37:56] for your ok to do something [15:38:00] sorry, akosiaris [15:38:03] I did not see it [15:38:08] I apologice [15:38:22] no worries, unless it's possible it will break something, let it be for now [15:38:28] I am almost done fixing the bug [15:38:45] I just will wait for you to finish [15:39:28] (03CR) 10Ottomata: "Looking fine https://puppet-compiler.wmflabs.org/4094/" [puppet] - 10https://gerrit.wikimedia.org/r/310826 (owner: 10Ottomata) [15:39:30] (03CR) 10Ottomata: [C: 032] Fix puppet warnings in confluent module [puppet] - 10https://gerrit.wikimedia.org/r/310826 (owner: 10Ottomata) [15:39:34] (03PS2) 10Ottomata: Fix puppet warnings in confluent module [puppet] - 10https://gerrit.wikimedia.org/r/310826 [15:39:36] (03CR) 10Ottomata: [V: 032] Fix puppet warnings in confluent module [puppet] - 10https://gerrit.wikimedia.org/r/310826 (owner: 10Ottomata) [15:39:58] I am not the only one that didn't see it :-) [15:40:45] <_joe_> he's not on irc either [15:40:49] <_joe_> wtf [15:40:52] <_joe_> seriously [15:41:19] (03PS2) 10Alexandros Kosiaris: puppet-merge: Remove forced exit 0 and fix typo [puppet] - 10https://gerrit.wikimedia.org/r/310837 [15:41:31] (03PS3) 10Alexandros Kosiaris: puppet-merge: Remove forced exit 0 and fix typo [puppet] - 10https://gerrit.wikimedia.org/r/310837 [15:41:33] (03PS3) 10Elukey: Add the new aqs nodes to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/310831 (https://phabricator.wikimedia.org/T144497) [15:41:36] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppet-merge: Remove forced exit 0 and fix typo [puppet] - 10https://gerrit.wikimedia.org/r/310837 (owner: 10Alexandros Kosiaris) [15:42:26] akosiaris: I'm going to deploy cxserver, Okay to go (Sorry, don't have context on what's issue here). [15:42:33] (?) [15:42:42] kart_: yeah [15:42:47] akosiaris: thanks! [15:43:28] 06Operations, 10LDAP-Access-Requests, 06TCB-Team, 06WMDE-Analytics-Engineering, and 2 others: Update wmde LDAP group - https://phabricator.wikimedia.org/T145384#2640883 (10Abraham) @ArielGlenn I confirm and give the official OK from my side. Thanks. [15:43:30] (03PS1) 10Giuseppe Lavagetto: exim: move templates to the role module [puppet] - 10https://gerrit.wikimedia.org/r/310838 [15:43:37] <_joe_> kart_: it's just puppet-merging that should be avoided [15:43:41] Hmm, there seems to be that issue again where deleted articles are in RC with the content [15:44:47] I saw it in Special:RecentChanges for at least 2 page refreshes [15:45:17] https://phabricator.wikimedia.org/T137280 hmm [15:45:42] That's because those RC entries are deleted at the end of the request and it sometimes take a long time for it to complete [15:45:49] _joe_: okay. [15:45:49] so if you happen to go to RC before it is complete, you see those entries [15:45:55] (03PS1) 10Alexandros Kosiaris: package_builder: Update README [puppet] - 10https://gerrit.wikimedia.org/r/310839 [15:45:55] !log Update cxserver to a1949e9 [15:45:57] Ah hey, Glaisher. [15:46:00] :D [15:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:05] (03PS2) 10Alexandros Kosiaris: package_builder: Update README [puppet] - 10https://gerrit.wikimedia.org/r/310839 [15:46:06] _joe_: Puppet SWAT is Okay? [15:46:09] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] package_builder: Update README [puppet] - 10https://gerrit.wikimedia.org/r/310839 (owner: 10Alexandros Kosiaris) [15:46:10] Yeah, it happened on Simple when I deleted that page. [15:46:29] Wasn't exactly immediate.. :S [15:47:33] 06Operations, 10Beta-Cluster-Infrastructure, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter and server to deployment-prep - https://phabricator.wikimedia.org/T144502#2640898 (10fgiunchedi) It'd be nice to have `role::prometheus::node_exporter` applied blanket to all of deployment-prep, I se... [15:48:41] Was there a database change, Glaisher? [15:48:43] (03PS1) 10Muehlenhoff: Bump version for jessie rebuild [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/310841 [15:48:51] Since June, I mean... [15:49:07] 06Operations, 10Beta-Cluster-Infrastructure, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter and server to deployment-prep - https://phabricator.wikimedia.org/T144502#2601885 (10AlexMonk-WMF) pretty sure it is, yeah [15:50:23] <_joe_> kart_: I don't think it is, no [15:50:36] <_joe_> akosiaris: is puppet-merge ok now? [15:50:41] no, not yet [15:51:16] Bsadowski1: As I already said, it sometimes takes a long time to complete [15:51:35] So it was due to lag? [15:51:43] meh :P [15:52:13] _joe_: ok. Lets reschedule my change then. [15:53:08] Bsadowski1: So after you finish doing a deletion and see the success output, lots of things are still happening on the server-side and RC cleanup is only one of those tasks so it's not immediate [15:55:15] I see. :) [15:55:40] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:53] !log uploaded trebuchet-trigger 0.5.6-1~jessie1 to carbon (no change rebuild for jessie) [15:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:56:00] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Walton - https://phabricator.wikimedia.org/T145788#2640942 (10Samwalton9) [15:56:21] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdh] [15:56:21] PROBLEM - puppet last run on titanium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:53] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Walton - https://phabricator.wikimedia.org/T145788#2640960 (10Samwalton9) [15:57:48] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Walton - https://phabricator.wikimedia.org/T145788#2640942 (10AlexMonk-WMF) > According to @Milimetric I should need the "researchers" and "statistics-users" groups. You'd need one or the other - not both. Looking at T115119 it seems... [15:59:12] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:59:36] (03PS2) 10Giuseppe Lavagetto: exim: move templates to the role module [puppet] - 10https://gerrit.wikimedia.org/r/310838 [15:59:38] (03PS1) 10Giuseppe Lavagetto: templates: move apache templates to the role module [puppet] - 10https://gerrit.wikimedia.org/r/310847 [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160915T1600). Please do the needful. [16:00:05] kart_ and hashar: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:19] <_joe_> puppet-swat is not going to happen today [16:00:55] :( [16:01:01] puppet merge is broken right ? [16:01:09] <_joe_> it's being transitioned [16:01:16] <_joe_> and alex has asked not to interfere [16:01:50] yeah that is understandable :/ I have added some new zuul .deb packages which we did not have time to upload this afternoon [16:02:05] can they get uploaded to apt.wm.o? (no puppet interference) [16:04:30] (03PS1) 10Alexandros Kosiaris: puppet-merge: Fix su invocations [puppet] - 10https://gerrit.wikimedia.org/r/310849 [16:05:07] 06Operations, 13Patch-For-Review: Decomission mw2061-mw2074 - https://phabricator.wikimedia.org/T144745#2641038 (10Papaul) Disk wipe complete Decommission tracking sheet update Racktables update Servers unracked complete [16:05:22] 06Operations, 06Release-Engineering-Team, 07HHVM: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2641039 (10MoritzMuehlenhoff) I also built trebuchet-trigger for jessie and uploaded it to apt.wikimedia.org [16:06:03] (03CR) 10Alexandros Kosiaris: [C: 032] puppet-merge: Fix su invocations [puppet] - 10https://gerrit.wikimedia.org/r/310849 (owner: 10Alexandros Kosiaris) [16:06:07] (03PS2) 10Alexandros Kosiaris: puppet-merge: Fix su invocations [puppet] - 10https://gerrit.wikimedia.org/r/310849 [16:06:09] (03CR) 10Alexandros Kosiaris: [V: 032] puppet-merge: Fix su invocations [puppet] - 10https://gerrit.wikimedia.org/r/310849 (owner: 10Alexandros Kosiaris) [16:07:31] 06Operations, 06Discovery, 06Maps, 10Traffic, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2641055 (10Yurik) I suspect that both databases / tilesets are fairly similar. Then again, we had some job scheduling issue recently, so maybe we s... [16:09:28] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:11:19] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:12:50] (03PS1) 10Alexandros Kosiaris: puppet-merge: Fix sha1 variable typo [puppet] - 10https://gerrit.wikimedia.org/r/310852 [16:12:51] <_joe_> these are just the precise hosts in my global run for puppetdb [16:12:54] <_joe_> ignore [16:12:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppet-merge: Fix sha1 variable typo [puppet] - 10https://gerrit.wikimedia.org/r/310852 (owner: 10Alexandros Kosiaris) [16:14:39] RECOVERY - puppet last run on titanium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:15:37] ACKNOWLEDGEMENT - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdh] Filippo Giunchedi part of T139767 [16:16:25] (03PS1) 10Alexandros Kosiaris: puppet-merge: One more typo fix [puppet] - 10https://gerrit.wikimedia.org/r/310854 [16:16:33] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppet-merge: One more typo fix [puppet] - 10https://gerrit.wikimedia.org/r/310854 (owner: 10Alexandros Kosiaris) [16:17:04] <_joe_> akosiaris: isn't it nice when you can't really test the things you write? :P [16:17:19] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:17:42] _joe_: in this specific case I just want to rip my eyes out [16:17:54] full of stupid little bugs... [16:18:37] <_joe_> akosiaris: happens pretty commonly :) [16:18:49] finally [16:18:50] (03PS1) 10Papaul: Decom: Remove DNS entries for mw20[6-7[[0-9] Bug:T144745 [dns] - 10https://gerrit.wikimedia.org/r/310856 (https://phabricator.wikimedia.org/T144745) [16:19:02] su - gitpuppet -c 'ssh -t -t puppetmaster2002.codfw.wmnet true 4d2b79fc22bfb14034c752dad523edf6948daa58 [16:19:23] OK: puppet-merge on puppetmaster2002.codfw.wmnet succeded [16:19:38] finally ... one more test and I 'll declare the situation resolved [16:20:26] (03CR) 10Muehlenhoff: [C: 032 V: 032] Bump version for jessie rebuild [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/310841 (owner: 10Muehlenhoff) [16:21:10] _joe_, also there's conftool-data in puppet [16:21:29] should we send it to a different path like conftool-data-deployment-prep and put our data there? [16:21:32] <_joe_> Krenair: well that is for prod and I would not create one set for beta in ops/puppet [16:21:48] <_joe_> I'd just create a dir in beta and update that manually, tbh [16:21:54] (03PS1) 10Alexandros Kosiaris: package_builder: Update README.md [puppet] - 10https://gerrit.wikimedia.org/r/310857 [16:22:05] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] package_builder: Update README.md [puppet] - 10https://gerrit.wikimedia.org/r/310857 (owner: 10Alexandros Kosiaris) [16:22:50] _joe_, thing is I think we have to point puppet to the correct path, so [16:23:04] and ofc git gc had to happen on my last test [16:23:38] jynus: _joe_ ottomata hashar and everyone else: puppet-merge works fine again, thanks for the patience. Situation resolved [16:23:58] Ah we can override conftool::master::sync_dir in hiera [16:24:50] akosiaris: congratulations :} [16:25:37] it is kind of cool to see you two sprinting the overhaul together :} [16:26:18] PROBLEM - puppet last run on prometheus2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:27:20] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:29:09] (03PS1) 10Alexandros Kosiaris: puppet-merge: run conftool-merge only on frontends [puppet] - 10https://gerrit.wikimedia.org/r/310859 [16:29:19] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppet-merge: run conftool-merge only on frontends [puppet] - 10https://gerrit.wikimedia.org/r/310859 (owner: 10Alexandros Kosiaris) [16:30:11] akosiaris: can we land the few patches of puppet swat so ? [16:30:41] hashar: I suppose so [16:30:52] (03PS1) 10Elukey: Add a directive to mod_proxy_html's yarn configuration [puppet] - 10https://gerrit.wikimedia.org/r/310862 (https://phabricator.wikimedia.org/T116192) [16:31:42] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] contint: drop browser test from Precise [puppet] - 10https://gerrit.wikimedia.org/r/308524 (owner: 10Hashar) [16:31:45] (03PS2) 10Alexandros Kosiaris: contint: drop browser test from Precise [puppet] - 10https://gerrit.wikimedia.org/r/308524 (owner: 10Hashar) [16:31:50] (03CR) 10Alexandros Kosiaris: [V: 032] contint: drop browser test from Precise [puppet] - 10https://gerrit.wikimedia.org/r/308524 (owner: 10Hashar) [16:32:43] all noop for prod [16:33:15] hashar regarding https://gerrit.wikimedia.org/r/#/c/300738/3/modules/contint/manifests/packages/androidsdk.pp,unified [16:33:28] we don't manage jenkins-deploy via puppet ? hence the hack ? [16:33:29] (03Abandoned) 10Elukey: Add a directive to mod_proxy_html's yarn configuration [puppet] - 10https://gerrit.wikimedia.org/r/310862 (https://phabricator.wikimedia.org/T116192) (owner: 10Elukey) [16:33:41] akosiaris: yeah jenkins-deploy is in ldap [16:33:55] (03PS1) 10Elukey: Add a directive to mod_proxy_html's yarn configuration [puppet] - 10https://gerrit.wikimedia.org/r/310863 (https://phabricator.wikimedia.org/T116192) [16:34:00] the reason for that patch is for the Android Slave [16:34:13] the android emulator requires access to /dev/kvm for some hardware emulation/speed up [16:34:16] (03CR) 10Alexandros Kosiaris: [C: 032] contint: role for Android testing [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://phabricator.wikimedia.org/T139137) (owner: 10Hashar) [16:34:21] (03PS4) 10Alexandros Kosiaris: contint: role for Android testing [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://phabricator.wikimedia.org/T139137) (owner: 10Hashar) [16:34:22] _joe_, well I've overridden the sync_dir to a place not in puppet [16:34:24] (03CR) 10Alexandros Kosiaris: [V: 032] contint: role for Android testing [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://phabricator.wikimedia.org/T139137) (owner: 10Hashar) [16:34:40] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:34:52] I guess the only risks are when we change puppetmasters or when someone updates the services in puppet without updating deployment-prep's data [16:35:21] <_joe_> Krenair: yes [16:37:22] _joe_: so the race condition with backends calling puppet-merge later and getting a different sha1 than the master is practically gone now with the use of a git sha1 hash supplied [16:37:25] akosiaris: and if you feel adventurous I could use a couple zuul .deb upload to apt.wm.o listed on https://phabricator.wikimedia.org/T103529#2632489 [16:37:50] pretty minor change, just update a few shebangs in bin scripts [16:38:09] (03PS4) 10Alexandros Kosiaris: Remove cxserver restbase_url [puppet] - 10https://gerrit.wikimedia.org/r/306674 (https://phabricator.wikimedia.org/T129284) (owner: 10KartikMistry) [16:38:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove cxserver restbase_url [puppet] - 10https://gerrit.wikimedia.org/r/306674 (https://phabricator.wikimedia.org/T129284) (owner: 10KartikMistry) [16:38:16] so I think that just leaves the confd thing [16:38:35] <_joe_> Krenair: for that, let me upload a new package [16:40:15] mmm no Jenkins? [16:40:25] or is it only me [16:41:00] ah ok just read about puppet merge [16:42:52] akosiaris: thanks! [16:43:20] elukey: CI is waiting for instances [16:43:43] nothing urgent, will do tomorrow [16:44:22] kart_: is deployment-zotero01 on beta under your umbrella? [16:46:16] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2641321 (10ema) I found something quite interesting while staring at ganglia. Look at cp4005's `fetch_304` before the ramp-up in `fetch_failed`, which is when 50... [16:47:09] hashar: please pass -sa to dpkg-buildpackage when building packages [16:47:28] it will add the orig source file in the .changes file [16:47:30] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 381 bytes in 0.036 second response time [16:47:37] akosiaris: .changes are missing the .orig.tar.gz ? :( [16:47:50] aarha [16:47:52] yes.. it means one more silly step for me and an error [16:47:55] elukey noticed that last time [16:48:27] and apparently git buildpackage or whatever else does not include the orig hash unless debian version is 1 [16:49:04] will have to hunt down how to always get -sa passed [16:49:12] !log uploaded to apt.wikimedia.org jessie-wikimedia: zuul_2.5.0-8-gcbc7f62-wmf3jessie1 [16:49:12] !log uploaded to apt.wikimedia.org precise-wikimedia: zuul_2.5.0-8-gcbc7f62-wmf3precise1 [16:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:40] RECOVERY - puppet last run on prometheus2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:51:24] !log change-prop deploy gerrit 310873 [16:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:53:02] akosiaris: I have filled a task for -sa ( https://phabricator.wikimedia.org/T145797 ) will find a solution to always -sa . Sorry about that :( [16:53:12] hashar: cool! thanks! [16:53:37] akosiaris: then we already had that tarball uploaded previously, so I am not sure why reprepro can't find it [16:53:57] it's about it not being in .changes files [16:54:06] that's why it complains, not that it can not find it [16:54:12] OHH [16:54:22] so it is merely a warning ? [16:54:46] er, no. it actually an error. It won't continue importing the packages [16:54:57] it can be bypassed by includedsc [16:55:07] but there is really no reason to go through that if you can avoid it [16:55:15] regardless, I will get it fixed [16:55:27] and make sure future .changes always have it [16:55:33] ok. thanks [16:55:35] thank you for the uploads and puppet merges \O/ [16:58:01] _joe_: so q: conftool-merge is required to run only on one frontend, right ? [17:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160915T1700). Please do the needful. [17:01:59] Nothing for ORES [17:02:34] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:02] I am leaving [17:03:11] will switch group1 2 hours from now [17:03:19] PROBLEM - puppet last run on prometheus1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:20] then most probably group2 as well [17:06:00] RECOVERY - puppet last run on prometheus1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:08:57] (03Abandoned) 10Dzahn: contint: don't use ensure 'latest' with php packages [puppet] - 10https://gerrit.wikimedia.org/r/310704 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [17:10:20] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:11:58] allah is doing [17:11:58] sun is not doing allah is doing [17:12:03] moon is not doing allah is doing [17:12:11] stars are not doing allah is doing [17:12:17] planets are not doing allah is doing [17:12:24] galaxies are not doing allah is doing [17:12:29] oceans are not doing allah is doinf [17:12:52] oceans are not doing allah is doing [17:12:52] mountains are not doing allah is doing [17:12:53] trees are not doing allah is doing [17:12:59] mom is not doing allah is doing [17:13:05] <_joe_> akosiaris: yes [17:13:06] dad is not doing allah is doing [17:13:12] boss is not doing allah is doing [17:13:14] _joe_: ban ? [17:13:20] <_joe_> akosiaris: ignore? [17:13:26] job is not doing allah is doing [17:13:26] dollar is not doing allah is doing [17:13:32] <_joe_> let this dickwad do his thing [17:13:46] <_joe_> and just ignore him [17:13:48] ok, k-lined probably [17:13:51] <_joe_> yes [17:14:00] Amateur compared to icinga [17:14:04] lol [17:14:17] hahaha [17:14:44] <_joe_> mark: indeed [17:15:50] (03PS2) 10Papaul: Decom: Remove mgmt DNS entries for mw2061-2074 [dns] - 10https://gerrit.wikimedia.org/r/310856 (https://phabricator.wikimedia.org/T144745) [17:15:54] (03PS3) 10Dzahn: Decom: Remove mgmt DNS entries for mw2061-2074 [dns] - 10https://gerrit.wikimedia.org/r/310856 (https://phabricator.wikimedia.org/T144745) (owner: 10Papaul) [17:16:35] (03CR) 10Dzahn: [C: 032] "mw2016 thru mw2074 have been powered down and wiped already" [dns] - 10https://gerrit.wikimedia.org/r/310856 (https://phabricator.wikimedia.org/T144745) (owner: 10Papaul) [17:17:54] 06Operations, 13Patch-For-Review: Decomission mw2061-mw2074 - https://phabricator.wikimedia.org/T144745#2608981 (10Dzahn) mgmt DNS entries removed https://gerrit.wikimedia.org/r/310856 [17:18:16] 06Operations, 13Patch-For-Review: Decomission mw2061-mw2074 - https://phabricator.wikimedia.org/T144745#2641447 (10Dzahn) @Papaul resolved ? [17:18:31] 06Operations: Decomission mw2061-mw2074 - https://phabricator.wikimedia.org/T144745#2641448 (10Dzahn) [17:18:32] !log change-prop deploy 310877 [17:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:45] Sigyn: ping [17:19:37] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2641453 (10Dzahn) contint module wants latest PHP packages, so that stays as it is [17:21:37] (03PS2) 10Dzahn: delete ircyall module and role [puppet] - 10https://gerrit.wikimedia.org/r/310692 (https://phabricator.wikimedia.org/T1357) [17:23:05] (03PS1) 10Andrew Bogott: labspuppetbackend: Add prefix delete [puppet] - 10https://gerrit.wikimedia.org/r/310879 (https://phabricator.wikimedia.org/T133412) [17:24:06] (03CR) 10Dzahn: [C: 031] "yes, i think the quick fix is just fine and as you said the list of language is rather static. i dont recall a single request in years to " [puppet] - 10https://gerrit.wikimedia.org/r/310746 (https://phabricator.wikimedia.org/T144933) (owner: 10Muehlenhoff) [17:25:01] (03CR) 10Dzahn: [C: 032] delete ircyall module and role [puppet] - 10https://gerrit.wikimedia.org/r/310692 (https://phabricator.wikimedia.org/T1357) (owner: 10Dzahn) [17:25:16] (03CR) 10jenkins-bot: [V: 04-1] labspuppetbackend: Add prefix delete [puppet] - 10https://gerrit.wikimedia.org/r/310879 (https://phabricator.wikimedia.org/T133412) (owner: 10Andrew Bogott) [17:27:25] (03PS2) 10Andrew Bogott: labspuppetbackend: Add prefix delete [puppet] - 10https://gerrit.wikimedia.org/r/310879 (https://phabricator.wikimedia.org/T133412) [17:27:27] (03Abandoned) 10Dzahn: mysql_wmf: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310712 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [17:27:30] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:27:51] (03Abandoned) 10Dzahn: contint: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310708 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [17:28:23] (03CR) 10Dzahn: [C: 031] mediawiki_singlenode: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310703 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [17:28:47] (03CR) 10Dzahn: [C: 031] gridengine: use present instead of latest in package [puppet] - 10https://gerrit.wikimedia.org/r/310710 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [17:30:05] (03CR) 10Yuvipanda: [C: 04-1] labspuppetbackend: Add prefix delete (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310879 (https://phabricator.wikimedia.org/T133412) (owner: 10Andrew Bogott) [17:31:10] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:32:00] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [17:32:11] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] [17:33:00] <_joe_> uh can someone look into this? ^^ [17:34:00] doesn't look to be coming from MW... [17:34:32] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:40:14] (03Abandoned) 10Dzahn: labstore: move resource defines into subdir, layout warnings [puppet] - 10https://gerrit.wikimedia.org/r/308131 (owner: 10Dzahn) [17:41:58] (03PS1) 10Yuvipanda: labs: Add more debugging to the httpyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/310886 [17:42:08] did I get kicked out of here too? [17:42:13] apparently [17:42:21] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] [17:42:26] (03PS2) 10Yuvipanda: labs: Add more debugging to the httpyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/310886 [17:42:31] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Add more debugging to the httpyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/310886 (owner: 10Yuvipanda) [17:43:00] (03Abandoned) 10Dzahn: salt-misc: add new target_type role (WIP) [software] - 10https://gerrit.wikimedia.org/r/276890 (owner: 10Dzahn) [17:44:23] !log depool and restart varnish-be on cp1049 [17:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:53] (03PS3) 10Andrew Bogott: labspuppetbackend: Add prefix delete [puppet] - 10https://gerrit.wikimedia.org/r/310879 (https://phabricator.wikimedia.org/T133412) [17:45:14] (03CR) 10Dzahn: [C: 032] "only comments" [puppet] - 10https://gerrit.wikimedia.org/r/310717 (https://phabricator.wikimedia.org/T127797) (owner: 10Dzahn) [17:45:30] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:24] (03PS2) 10Dzahn: apache: fix 42 x 'class not documented', add doc links [puppet] - 10https://gerrit.wikimedia.org/r/310717 (https://phabricator.wikimedia.org/T127797) [17:47:27] (03PS4) 10Andrew Bogott: labspuppetbackend: Add prefix delete [puppet] - 10https://gerrit.wikimedia.org/r/310879 (https://phabricator.wikimedia.org/T133412) [17:49:10] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:49:20] PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:27] (03CR) 10Yuvipanda: [C: 04-1] labspuppetbackend: Add prefix delete (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310879 (https://phabricator.wikimedia.org/T133412) (owner: 10Andrew Bogott) [17:50:49] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:51:32] (03PS5) 10Andrew Bogott: labspuppetbackend: Add prefix delete [puppet] - 10https://gerrit.wikimedia.org/r/310879 (https://phabricator.wikimedia.org/T133412) [17:51:35] ^ i just ran it on neon [17:52:11] (03PS3) 10Dzahn: apache: fix 42 x 'class not documented', add doc links [puppet] - 10https://gerrit.wikimedia.org/r/310717 (https://phabricator.wikimedia.org/T127797) [17:53:12] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:53:41] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] [17:56:04] (03PS1) 10Yuvipanda: labs: Actually add debugging to httpyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/310890 [17:56:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:00:04] anomie, ostriches, thcipriani, hashar, twentyafterfour, and aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160915T1800). Please do the needful. [18:00:04] yurik, jgirault, and MatmaRex: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:18] here [18:00:54] hi [18:00:54] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:41] I can SWAT [18:03:08] jgirault: hrm, looks like one of those patches will need a full scap [18:03:24] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.095 second response time [18:03:32] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] [18:04:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:04:25] (03PS2) 10Yuvipanda: labs: Actually add debugging to httpyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/310890 [18:04:28] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Actually add debugging to httpyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/310890 (owner: 10Yuvipanda) [18:05:15] (03PS1) 10Dzahn: varnish/htcppurger: don't use ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/310895 (https://phabricator.wikimedia.org/T115348) [18:06:33] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:08:55] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:09:31] (03PS1) 10Dzahn: base: don't use 'latest' for standard package installs [puppet] - 10https://gerrit.wikimedia.org/r/310897 (https://phabricator.wikimedia.org/T115348) [18:09:57] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2641627 (10Dzahn) [18:11:27] (03CR) 10Dzahn: "@hashar There is "debdeploy" meanwhile and afaik that already does the job that unattended upgrades would do." [puppet] - 10https://gerrit.wikimedia.org/r/310702 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [18:12:07] jgirault: https://gerrit.wikimedia.org/r/#/c/310848/ is live on mw1099, check please [18:12:42] (03Abandoned) 10Dzahn: mariadb: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310702 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [18:12:56] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:14:08] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2641644 (10Dzahn) [18:14:47] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#1722037 (10Dzahn) [18:15:15] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#1722037 (10Dzahn) [18:15:34] RECOVERY - puppet last run on elastic2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:16:16] (03CR) 10Dzahn: [C: 031] ldap: Use require_package instead of package latest [puppet] - 10https://gerrit.wikimedia.org/r/310706 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [18:16:27] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:16:28] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:16:57] mutante: hi about package => latest, some can be migrated to unattended-upgrades, but it is not trivial :( [18:17:09] hashar: debdeploy [18:17:29] no way I am going to add yet another software to the stack :D [18:17:40] at least for CI, we get upgrade from upstream distros [18:17:46] so we dont really know when the upgrade has to be done [18:18:08] hashar: that's Moritz' work and has already been done. it's the reason why that audit ticket exists afaict [18:18:09] it is much easier to have it handled automatically, and ensure => latest is the simplest way to achieve that [18:18:25] i already abandoned the ones for CI [18:18:31] yeah [18:18:38] it's not new [18:18:42] but got the point to migrate to unattended upgrades nonetheless :] [18:18:53] (03PS1) 10Yuvipanda: labs: Remove debugging from httpyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/310898 [18:18:55] (03PS1) 10Yuvipanda: labtest: Get rid of mwyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310899 (https://phabricator.wikimedia.org/T145808) [18:19:08] (03PS2) 10Yuvipanda: labs: Remove debugging from httpyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/310898 [18:19:12] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Remove debugging from httpyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/310898 (owner: 10Yuvipanda) [18:19:32] hashar: that would be using 2 different things to achieve the same thing though [18:19:43] jgirault: are there any problems with the css change on mw1099? [18:20:41] thcipriani yes, I confirm it works [18:20:49] (03CR) 10jenkins-bot: [V: 04-1] labtest: Get rid of mwyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310899 (https://phabricator.wikimedia.org/T145808) (owner: 10Yuvipanda) [18:21:00] jgirault: cool, thanks, going everywhere [18:21:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:22:00] hashar: https://wikitech.wikimedia.org/wiki/Software_deployment#Proposed_implementation afaict [18:22:51] (03PS1) 10Yuvipanda: Revert "labtest hiera: use labtestwikitech, not wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/310900 (https://phabricator.wikimedia.org/T145808) [18:22:56] (03PS2) 10Yuvipanda: Revert "labtest hiera: use labtestwikitech, not wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/310900 (https://phabricator.wikimedia.org/T145808) [18:23:12] (03CR) 10Yuvipanda: "Abandoned in favor of https://gerrit.wikimedia.org/r/#/c/310900/" [puppet] - 10https://gerrit.wikimedia.org/r/310899 (https://phabricator.wikimedia.org/T145808) (owner: 10Yuvipanda) [18:23:16] (03Abandoned) 10Yuvipanda: labtest: Get rid of mwyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310899 (https://phabricator.wikimedia.org/T145808) (owner: 10Yuvipanda) [18:23:49] !log depool and restart varnish-be on cp1074 [18:23:53] !log thcipriani@tin Synchronized php-1.28.0-wmf.19/extensions/Kartographer: SWAT: [[gerrit:310848|Fix map popup CSS (T145716)]] (duration: 00m 56s) [18:23:54] T145716: Map popup CSS broke in 1.28.0-wmf.19 - https://phabricator.wikimedia.org/T145716 [18:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:23:59] ^ jgirault live everywhere [18:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:24:08] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:24:45] thcipriani verified, all good. [18:24:57] jgirault: awesome, thanks for checking :) [18:25:26] (03CR) 10Yuvipanda: [C: 032] Revert "labtest hiera: use labtestwikitech, not wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/310900 (https://phabricator.wikimedia.org/T145808) (owner: 10Yuvipanda) [18:25:41] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2641695 (10Dzahn) [18:25:56] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#1722037 (10Dzahn) [18:30:53] MatmaRex: https://gerrit.wikimedia.org/r/#/c/310872/ is live on mw1099, check please [18:32:20] thcipriani: lgtm, the module exists [18:32:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:33:05] !log titanium - stop salt, stop puppet, revoke puppet cert, delete salt key [18:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:33:20] MatmaRex: kk, syncing this is tricky. I'm just going to wait for https://gerrit.wikimedia.org/r/#/c/310851/ to merge for jgirault and then do a full scap. [18:33:54] sure [18:33:56] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [18:36:58] !log thcipriani@tin Started scap: SWAT: [[gerrit:310851|Add missing close button title message (T145774)]] and [[gerrit:310872|Revert "Remove jquery.arrowSteps module" (T144974)]] [18:37:00] T145774: Missing fullscreen close button title message - https://phabricator.wikimedia.org/T145774 [18:37:00] T144974: Move jquery.arrowSteps to UploadWizard and remove unnecessary code - https://phabricator.wikimedia.org/T144974 [18:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:49] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:43:18] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [18:44:44] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:55:59] thcipriani: ah sorry tried to do a scap [18:56:07] forgot to check whether the swat was completed :] [18:56:16] hashar: heh, didn't notice [18:56:25] php-1.28.0-wmf.19/extensions/CentralAuth/maintenance/fixStuckGlobalRename.php [18:56:29] so should not overlap :) [18:56:32] hashar: sorry, running long. Have a full scap going. [18:56:55] no problem [18:57:02] I got the script from master [18:57:14] backported to wmf branch and I can run it from the staging area of tin just fine [18:57:38] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/sudoers] [18:59:42] !log depool and restart varnish-be on cp1099 [18:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:05] hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160915T1900). [19:00:59] bah [19:01:12] deploypromote does CR+2 automatically [19:02:12] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [19:02:45] hashar: just changed IIRC [19:02:55] yeah that is quite handy [19:03:01] scap still running ? [19:03:09] yeah, rebuilding cdbs right now [19:03:28] about ¼ of the way done with that [19:03:41] I am wondering why we made it to rebuild on each of the servers [19:03:45] monospace fractions. 1/2 now [19:04:10] because cdb files defeat rsync is my understanding. [19:04:18] like a minor change changes the whole file [19:04:25] so we'd end up rsyncing a lot of data [19:04:48] but my rsyncing json files, we rsync a tiny amount of data, and then just rebuild the cdb on the other side [19:05:06] !log thcipriani@tin Finished scap: SWAT: [[gerrit:310851|Add missing close button title message (T145774)]] and [[gerrit:310872|Revert "Remove jquery.arrowSteps module" (T144974)]] (duration: 28m 08s) [19:05:08] T145774: Missing fullscreen close button title message - https://phabricator.wikimedia.org/T145774 [19:05:08] T144974: Move jquery.arrowSteps to UploadWizard and remove unnecessary code - https://phabricator.wikimedia.org/T144974 [19:05:11] magic [19:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:17] ^ MatmaRex jgirault sync'd! [19:05:31] syncing a couple changes [19:05:59] 06Operations, 10Analytics-Cluster, 13Patch-For-Review: decom titanium - https://phabricator.wikimedia.org/T145666#2641821 (10Dzahn) 11:33 < mutante> !log titanium - stop salt, stop puppet, revoke puppet cert, delete salt key [19:06:09] !log hashar@tin Synchronized php-1.28.0-wmf.19/extensions/CentralAuth/maintenance/fixStuckGlobalRename.php: To unblock renames stuck on mediawiki.org T145596 (duration: 00m 47s) [19:06:10] T145596: Renames getting stuck on mediawiki.org (Sept 13, 2016) - https://phabricator.wikimedia.org/T145596 [19:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:03] !log hashar@tin Synchronized php-1.28.0-wmf.18/extensions/CentralAuth/maintenance/fixStuckGlobalRename.php: To unblock renames stuck on mediawiki.org T145596 (duration: 00m 47s) [19:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:11] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:08:19] !log depool and restart varnish-be on cp1072 [19:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:50] thcipriani: can i do the group1 upgrade? [19:08:59] hashar: yup, all clear [19:09:08] * hashar press Enter [19:09:11] :) [19:09:20] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.19 [19:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:09:32] will let it settle [19:09:36] then do group2 [19:11:00] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:12:39] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:16:53] * hashar yawns [19:17:40] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [19:17:59] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:20:11] (03CR) 10Papaul: [C: 032] partman: delete some more unused recipes [puppet] - 10https://gerrit.wikimedia.org/r/306501 (owner: 10Dzahn) [19:20:13] !log depool and restart varnish-be on cp1063 [19:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:29] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [19:22:00] PROBLEM - Varnishkafka log producer on cp1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [19:23:45] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:24:30] RECOVERY - Varnishkafka log producer on cp1074 is OK: PROCS OK: 1 process with command name varnishkafka [19:25:36] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2641896 (10hashar) To manually fix a blocked rename, one can run: ``` mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php ```... [19:26:09] so [19:26:15] going to push to group2 [19:26:47] so far all my checks look sane [19:27:11] yeah I have checked the hewiki page from yesterday [19:27:23] that was a really nice catch [19:27:26] ytes, thanks i saw on the ticker [19:27:54] s/ticker/ticket ytes/yes [19:28:57] oh [19:29:01] CirrusSearch has exploded [19:29:10] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1733 bytes in 0.745 second response time [19:30:03] things are slow on Wikidata [19:30:03] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:33:45] !log depool and restart varnish-be on cp1050 [19:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:34:26] and wikidata dispatch lag is stall https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch :( [19:34:44] Did we deploy? [19:35:35] It's maybe not too bad yet... a little high but maybe a bot has gone crazy [19:36:04] yeah [19:36:27] audephone: correlates with the wmf.19 deployments [19:36:33] I would just check again in maybe 10-15 min to see if it gets worse [19:36:39] Oh [19:36:41] so I guess I will revert wikidata to wmf.18 [19:37:00] If it doesn't recover on its own soon [19:37:21] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2641931 (10Ladsgroup) Thanks for the tip @hashar. I guess people can ping me (or lots of other people with access to terbium) to do it in case... [19:37:36] I can take a look later when I get home, if still needed [19:38:03] (03PS1) 10Hashar: Revert "all wikis to 1.28.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310925 (https://phabricator.wikimedia.org/T143328) [19:38:10] audephone: take your time :) [19:38:25] (03CR) 10Hashar: [C: 032] "Has never been deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310925 (https://phabricator.wikimedia.org/T143328) (owner: 10Hashar) [19:38:34] Ok [19:38:48] It looks nothing to panic yet [19:38:51] (03Merged) 10jenkins-bot: Revert "all wikis to 1.28.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310925 (https://phabricator.wikimedia.org/T143328) (owner: 10Hashar) [19:39:20] Thanks [19:39:51] audephone: it has a bunch of exceptions in PASS though [19:39:55] will try to dig that [19:41:45] ah lalala [19:41:50] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 381 bytes in 0.027 second response time [19:41:52] wikidatawiki terbium Failed to run getConfiguration.php. [19:45:06] !log depool and restart varnish-be on cp1073 [19:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:50:24] !log depool and restart varnish-be on cp1062 [19:50:25] moving wikidatawiki back [19:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:51:03] (03PS1) 10Hashar: wikidatawiki back to 1.28.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310934 (https://phabricator.wikimedia.org/T145819) [19:51:12] (03CR) 10Hashar: [C: 032] wikidatawiki back to 1.28.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310934 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [19:51:24] thcipriani: I will not be able to do group2 [19:51:36] (03Merged) 10jenkins-bot: wikidatawiki back to 1.28.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310934 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [19:52:18] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [19:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:54:00] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.083 second response time [19:54:18] ^^ that will probaly affect grrrit-wm bot [19:54:33] nope, that shouldn't affect anything running on k8s [19:55:49] (03PS1) 10Yuvipanda: labs: Add httpyaml hiera backend to labs [puppet] - 10https://gerrit.wikimedia.org/r/310938 (https://phabricator.wikimedia.org/T91990) [19:55:52] Oh [19:55:54] andrewbogott: ^^ [19:56:15] (03CR) 10Jdlrobson: Initiate Hovercards A/B test on ruwiki and itwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310483 (https://phabricator.wikimedia.org/T136746) (owner: 10Jhobs) [19:56:24] !log depool and restart varnish-be on cp1064 [19:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:58:14] hashar: np. looks like there are problems in a few places? [19:58:20] yeah [19:58:22] (03CR) 10Andrew Bogott: [C: 031] labs: Add httpyaml hiera backend to labs [puppet] - 10https://gerrit.wikimedia.org/r/310938 (https://phabricator.wikimedia.org/T91990) (owner: 10Yuvipanda) [19:58:31] the wikidata process to dispatch changes to the other wikis got some issue [19:58:34] (03PS2) 10Yuvipanda: labs: Add httpyaml hiera backend to labs [puppet] - 10https://gerrit.wikimedia.org/r/310938 (https://phabricator.wikimedia.org/T91990) [19:58:42] and I am HIGHLY suspecting some job / cron on terbium [19:58:43] (03CR) 10Yuvipanda: [C: 032 V: 032] "Let's see what breaks!" [puppet] - 10https://gerrit.wikimedia.org/r/310938 (https://phabricator.wikimedia.org/T91990) (owner: 10Yuvipanda) [19:59:19] thcipriani: I have only reverted wikidata for now [19:59:33] maybe we can move the rest to .19 anyway [19:59:48] !log depool and restart varnish-be on cp1048 [19:59:50] looks like a lot of the 'Failed to run getConfiguration.php' in the logs for wmf.19 [19:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:25] lots of jobqueue stuff failing for wmf.19, seemingly. [20:01:29] !log restart puppetmaster on labcontrol1001 to pick up hiera changes [20:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:41] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1723 bytes in 0.240 second response time [20:02:48] thcipriani: any specific jobs? [20:04:05] hashar: lots of RunJobs.php stuff in fatalmonitor on logstash but nothing too crazy. Maybe that "normal" :(( [20:04:54] https://logstash.wikimedia.org/goto/f3dc98d9148473f8ade37a6bd45c8b12 [20:05:08] (03PS1) 10Yuvipanda: labs: Use puppetmaster service URL for hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310945 (https://phabricator.wikimedia.org/T91990) [20:05:16] although it seems like it was mostly wikidatawiki and it was all wmf.19...so [20:05:34] andrewbogott: ^^ [20:05:38] (or at least everything from terbium was wikidatawiki) [20:05:42] yeah looks not so good [20:05:59] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:06:42] (03PS2) 10Yuvipanda: labs: Use puppetmaster service URL for hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/310945 (https://phabricator.wikimedia.org/T91990) [20:06:54] (03CR) 10Yuvipanda: [C: 032 V: 032] "This is needed to unbreak self hosted puppetmasters" [puppet] - 10https://gerrit.wikimedia.org/r/310945 (https://phabricator.wikimedia.org/T91990) (owner: 10Yuvipanda) [20:07:22] hashar: yeah, I can keep an eye on things if you've got to run. Rollback if they get worse. Roll forward if possible. [20:07:44] yuvipanda: it's really as simple as that? [20:07:50] andrewbogott: for hiera yeah [20:08:00] andrewbogott: need a little more work for enc [20:08:05] thcipriani: I think the wikidata dispatch lag is gone now ( https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch ) [20:08:10] (over 2 hours ) [20:08:21] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:08:47] !log increasing number of shards per node for enwiki_content index to 2 on elasticsearch codfw [20:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:07] hashar: ^ that should help the situation. [20:09:22] andrewbogott: it would've been equally trivial if the self hosted puppetmaster code wasn't a total duplicate of the puppetmaster code tho [20:09:44] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:10:19] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:39] hashar: yeah, looking at that graph w/deployments seems to match up pretty well :\ [20:10:43] gehel: looks like java to me but I trust you :] [20:10:51] thcipriani: yeah I am pretty sure [20:11:03] but then I could not find any exception / error or whatever [20:11:09] beside the getConfiguration.php error on terbium [20:11:26] hashar: we were talking about that earlier with dcausse. This cluster restart is a bit worse than usual. It is the first one since we activated row aware shard allocation... [20:11:41] thcipriani: at least https://www.wikidata.org/wiki/Special:DispatchStats shows a lag of 0 minutes [20:11:47] still, during a restart, we do loose shards and we expect a few errors... [20:12:05] gehel: bunch of message were complaining about lack of quorum to achieve consistency [20:12:09] so I guess it is transient [20:12:09] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:12:49] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [20:13:53] !log varnish-be esams cache_upload: rolling depool and restart [20:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:14:04] seems like the wikidata regression is fairly serious. Enough to be noticed anyway. Seems reasonable to hold the train for it. Do we have a task? [20:14:41] yeah [20:14:46] https://phabricator.wikimedia.org/T145819 [20:14:48] and [20:14:52] !log remove self from github wikimedia org, was getting spammed for each new repo creation [20:14:55] I am really tempted to move all the rest to .19 [20:14:59] but keep wikidata at .18 [20:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:12] then check with them tomorrow and do a friday update of wikidata or hold it until monday [20:16:25] (03PS1) 10Hashar: All wiki but wikidatawiki to php-1.28.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310948 (https://phabricator.wikimedia.org/T145819) [20:16:29] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/x509-bundle] [20:17:30] thcipriani: "All wiki but wikidatawiki to php-1.28.0-wmf.19" sounds like aplan ? [20:17:36] hrm. We've never done that before. It would be bad to get too far out of sync. I suppose I don't see any harm in keeping this train on schedule, but we can't run any future trains with wikidata two revisions back. [20:17:57] I can take a look later [20:18:01] my idea is to check in with wikidata tomorrow on friday [20:18:01] hashar: so, in short, sure your call :) [20:18:10] and depending on the chat outcome [20:18:14] decide to freeze train next week [20:18:15] Maybe we can get wikidata on wmf19 tomorrow [20:18:25] eg keep wikidata at .18 and rest at .19 until that is figured out [20:18:28] yeah, that seems all fine to me [20:18:30] Or Monday, [20:18:46] if we're not seeing regressions anywhere else, which so far doesn't seem like it [20:18:52] with me being 101% sure that the Berlin people will figure out the fix before 8am local time :] [20:19:03] Seems something changed in core that requires some change in wikibase [20:19:18] That hasn't been done yet or not deployed yet [20:19:30] audephone: there was a few exceptions on terbium related to getConfiguration . I guess we can look at it tomorrow [20:19:36] Ok [20:20:07] (03CR) 10Hashar: [C: 032] "https://en.wikipedia.org/wiki/WP:BB" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310948 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [20:20:22] need a drink first [20:20:31] (03CR) 10Thcipriani: "Looks like this moves wikidatawiki to wmf.19 but keeps testwikidatawiki at wmf.18" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310948 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [20:20:36] (03Merged) 10jenkins-bot: All wiki but wikidatawiki to php-1.28.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310948 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [20:21:23] hashar: did you see my comment? Looks like that patch moves wikidatawiki to wmf.19 [20:21:35] but keeps testwikidatawiki at wmf.18 [20:23:09] ahh [20:23:12] my vim foo is bad [20:23:39] that is testwikidata bah [20:23:46] keeping that at .18 as well [20:23:52] kk [20:24:33] (03PS6) 10Andrew Bogott: labspuppetbackend: Add prefix delete [puppet] - 10https://gerrit.wikimedia.org/r/310879 (https://phabricator.wikimedia.org/T133412) [20:24:42] (03PS1) 10Yuvipanda: puppet: Add option to use newer ENC [puppet] - 10https://gerrit.wikimedia.org/r/310952 (https://phabricator.wikimedia.org/T91990) [20:24:52] (03PS2) 10Yuvipanda: puppet: Add option to use newer ENC [puppet] - 10https://gerrit.wikimedia.org/r/310952 (https://phabricator.wikimedia.org/T91990) [20:25:01] (03PS1) 10Hashar: Fix wikidata to .18 (previous was testwikidata) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310953 (https://phabricator.wikimedia.org/T145819) [20:25:22] thcipriani: https://gerrit.wikimedia.org/r/310953 sorry [20:25:27] I guess I am tired :) [20:25:31] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:25:35] will babysit that one for an half an hour and get to bed [20:25:40] and if in doubt revert [20:25:59] (03CR) 10Thcipriani: [C: 031] Fix wikidata to .18 (previous was testwikidata) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310953 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [20:26:05] sounds good to me :) [20:26:11] (03CR) 10Hashar: "Sorry bad search and replace :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310948 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [20:26:22] (03CR) 10Hashar: [C: 032] Fix wikidata to .18 (previous was testwikidata) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310953 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [20:26:26] (03PS3) 10Yuvipanda: puppet: Add option to use newer ENC [puppet] - 10https://gerrit.wikimedia.org/r/310952 (https://phabricator.wikimedia.org/T91990) [20:26:33] so i guess tomorrow I will move testwikidatawiki to .19 [20:26:40] if it help debugging [20:26:50] (03Merged) 10jenkins-bot: Fix wikidata to .18 (previous was testwikidata) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310953 (https://phabricator.wikimedia.org/T145819) (owner: 10Hashar) [20:27:04] (03CR) 10Andrew Bogott: [C: 032] labspuppetbackend: Add prefix delete [puppet] - 10https://gerrit.wikimedia.org/r/310879 (https://phabricator.wikimedia.org/T133412) (owner: 10Andrew Bogott) [20:27:59] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: All wiki to .19. Keep testwikidata and wikidata at .18 (commits: 38603f0 770d336) [20:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:09] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:35:27] those DBTransaction errors are surging :( [20:35:55] !log increasing number of shards per node for dewiki_content index to 2 on elasticsearch codfw [20:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:37:21] hmm was a spike of dberrors [20:38:01] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:38:10] (03PS1) 10Andrew Bogott: labspuppetbackend: fixed mysql typo [puppet] - 10https://gerrit.wikimedia.org/r/310958 (https://phabricator.wikimedia.org/T133412) [20:38:18] yeah, looks mostly fine. Still seeing a bunch of 'Failed to run getConfiguration.php' but evidently not enough to be trending in logstash so maybe it's just concentration bias [20:38:29] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:39:15] probaqbly [20:39:24] for wikidata it is best to wait for wikidata folks investigation [20:39:29] (03CR) 10Andrew Bogott: [C: 032] labspuppetbackend: fixed mysql typo [puppet] - 10https://gerrit.wikimedia.org/r/310958 (https://phabricator.wikimedia.org/T133412) (owner: 10Andrew Bogott) [20:40:41] "This script must be run from the command line" eek [20:41:11] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:42:09] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:43:34] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth, 13Patch-For-Review: GlobalRename gets stuck sometimes - https://phabricator.wikimedia.org/T137973#2642109 (10MarcoAurelio) May I suggest to create a script which doesn't need to be run on each wiki where the rename gets stuck. I think it mig... [20:44:19] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:44:39] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [20:45:04] (03PS1) 10Andrew Bogott: puppet panel: Add a button to remove prefixes. [puppet] - 10https://gerrit.wikimedia.org/r/310962 (https://phabricator.wikimedia.org/T91990) [20:45:21] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:45:42] (03Draft1) 10Paladox: Switch analytics_cluster to the new mariadb::mylvmbackup module [puppet] - 10https://gerrit.wikimedia.org/r/310963 [20:45:49] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:46:50] (03PS2) 10Paladox: Switch analytics_cluster to the new mariadb::mylvmbackup module [puppet] - 10https://gerrit.wikimedia.org/r/310963 [20:47:04] (03CR) 10Andrew Bogott: [C: 032] puppet panel: Add a button to remove prefixes. [puppet] - 10https://gerrit.wikimedia.org/r/310962 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [20:48:05] (03Draft1) 10Paladox: Remove mysql_wmf::mylvmbackup module [puppet] - 10https://gerrit.wikimedia.org/r/310964 [20:48:19] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:49:07] (03PS2) 10Paladox: Remove mysql_wmf::mylvmbackup module [puppet] - 10https://gerrit.wikimedia.org/r/310964 [20:50:22] (03PS3) 10Paladox: Switch analytics_cluster to the new mariadb::mylvmbackup module [puppet] - 10https://gerrit.wikimedia.org/r/310963 [20:53:59] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [20:54:26] thcipriani: I think it is all good so far [20:54:31] so keeping all wikis to .19 [20:55:04] !log All wikis are on 1.28.0-wmf.19 wikidatawiki / testwikidatawiki stick to .18 for now. [20:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:55:32] hashar: :) [20:55:39] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [20:57:05] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/310835 (owner: 10Filippo Giunchedi) [20:58:54] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: /var/lib/puppet/lib/hiera/httpcache.rb:21: duplicate optional argument name [20:59:04] yuvipanda: the puppetmaster on integration-puppetmaster is borked / exploded :( [21:00:03] ah, nice. [21:00:19] integration-puppetmaster.integration.eqiad.wmflabs is the host :D [21:00:19] I'm pretty sure that's because it's a precise puppetmaster running a much older version of ruby [21:01:32] def read(path, _=nil, _=nil) eek [21:01:55] anbyway I cant babysit that :D gotta sleep [21:02:05] I'm more in a mind to rip out rake [21:02:09] that is there to satisfy it [21:02:27] the function needs to have 3 params, even if 2 are unused, but then rake will complain and suggest I make them _s [21:02:32] and so I did and now it doesn't work on precise [21:02:36] (03CR) 10Dzahn: "i'm not the right reviewer for this" [puppet] - 10https://gerrit.wikimedia.org/r/310964 (owner: 10Paladox) [21:02:50] rake ? [21:03:01] (03CR) 10Dzahn: "i'm not the right reviewer for this" [puppet] - 10https://gerrit.wikimedia.org/r/310963 (owner: 10Paladox) [21:03:10] oh rubocop [21:03:29] yeah [21:03:50] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:04:21] yuvipanda: maybe reuse the function signature of the parent class? [21:04:43] it was, and rubocop complains about hat, since the secodn two parameters are optional with defaults and I don't use them [21:05:40] yuvipanda: same as yesterday I guess [21:05:41] .rubocop_todo.yml [21:05:48] meh [21:05:50] Lint/UnusedMethodArgument: [21:06:00] I'll leave it broken for a while and fix it after I am done doing what I'm doing :) [21:06:05] or dish out that linting rule entirely [21:06:06] thanks for pointing it out, hashar [21:06:13] (03CR) 10Volans: "Given that we're going with a percentage that in theory could fit all cases, what about setting those values as defaults for warning and c" [puppet] - 10https://gerrit.wikimedia.org/r/309203 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk) [21:06:23] cause really sometime rubocop is being pedantic :] [21:06:28] thx ! [21:06:58] sleepy sleepy [21:08:09] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:12:39] (03PS1) 10Andrew Bogott: Puppet panel: Remove some accidental copypasta from the puppet tab [puppet] - 10https://gerrit.wikimedia.org/r/311022 [21:14:00] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [21:14:13] (03CR) 10Volans: [C: 04-1] "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/309708 (owner: 10Alex Monk) [21:14:27] (03CR) 10Andrew Bogott: [C: 032] Puppet panel: Remove some accidental copypasta from the puppet tab [puppet] - 10https://gerrit.wikimedia.org/r/311022 (owner: 10Andrew Bogott) [21:15:29] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [21:21:20] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:24:51] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:25:13] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:28:10] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [21:30:45] (03PS1) 10Yuvipanda: labs: Fix httpyaml backend for precise puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/311025 [21:32:22] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [21:34:50] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:37:52] (03CR) 10Paladox: "@Chasemp sorry to ask but could you review and merge please?" [puppet] - 10https://gerrit.wikimedia.org/r/308313 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [21:38:11] (03CR) 10Paladox: "@Chasemp sorry to ask but could you review and merge please?" [puppet] - 10https://gerrit.wikimedia.org/r/310703 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [21:38:26] (03CR) 10Paladox: "@Chasemp sorry to ask but could you review and merge please?" [puppet] - 10https://gerrit.wikimedia.org/r/310706 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [21:39:02] (03CR) 10Paladox: "@Chasemp sorry to ask but could you review and merge please?" [puppet] - 10https://gerrit.wikimedia.org/r/308340 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [21:39:20] (03CR) 10Paladox: "Or any other labs team members please?" [puppet] - 10https://gerrit.wikimedia.org/r/308340 (https://phabricator.wikimedia.org/T93645) (owner: 10Paladox) [21:39:24] (03CR) 10Paladox: "Or any other labs team members please?" [puppet] - 10https://gerrit.wikimedia.org/r/310706 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [21:39:27] (03CR) 10Paladox: "Or any other labs team members please?" [puppet] - 10https://gerrit.wikimedia.org/r/310703 (https://phabricator.wikimedia.org/T115348) (owner: 10Paladox) [21:39:31] (03CR) 10Paladox: "Or any other labs team members please?" [puppet] - 10https://gerrit.wikimedia.org/r/308313 (https://phabricator.wikimedia.org/T93645) (owner: 10Hashar) [21:50:27] (03CR) 10Yuvipanda: [C: 032] labs: Fix httpyaml backend for precise puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/311025 (owner: 10Yuvipanda) [21:51:24] (03PS2) 10Yuvipanda: labs: Fix httpyaml backend for precise puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/311025 [21:54:49] PROBLEM - Host mw1294 is DOWN: PING CRITICAL - Packet loss = 100% [22:01:23] (03CR) 10Alex Monk: "I actually am setting them as the default. They're also shown in all the examples and some other places they don't need to be though." [puppet] - 10https://gerrit.wikimedia.org/r/309203 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk) [22:11:56] (03PS1) 10Andrew Bogott: Puppet Panel: Speed up loading of role documentation [puppet] - 10https://gerrit.wikimedia.org/r/311041 (https://phabricator.wikimedia.org/T91990) [22:28:53] (03PS1) 10Alex Monk: ruby-httpclient callers: Use the operating system's certificate store [puppet] - 10https://gerrit.wikimedia.org/r/311048 (https://phabricator.wikimedia.org/T145808) [22:31:29] (03CR) 10Volans: check_ssl: Use a maximum percentage of certificate validity time for determining alert state (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/309203 (https://phabricator.wikimedia.org/T144293) (owner: 10Alex Monk) [22:38:08] 06Operations, 10ops-esams: bast3001 garbled console - https://phabricator.wikimedia.org/T145756#2639892 (10Dzahn) I did a "racadm racreset" and afterwards i saw this again: ^��{���9By{�Nʄ?h(Z�ԫ�ʔ�^k���^j�)k_Ą�ۖN��{��8x��(^���nԯ�zBJB`�z����wB�ZC�^��{���9By{�Nʄ?h(Z�ԫ�ʔ�^k���^j�)k_Ą�ۖN��{��8x��(^� [22:38:36] heh [22:39:07] 06Operations, 10ops-esams: bast3001 garbled console - https://phabricator.wikimedia.org/T145756#2642310 (10Dzahn) 05Open>03Resolved a:03Dzahn reconnected one more time and the garbled stuff is gone: Debian GNU/Linux 8 bast3001 ttyS1 bast3001 login: [22:41:34] how do you check if ES indexes are frozen? [22:41:39] (03PS3) 10Alex Monk: check_ssl: Use a maximum percentage of certificate validity time for determining alert state [puppet] - 10https://gerrit.wikimedia.org/r/309203 (https://phabricator.wikimedia.org/T144293) [22:41:46] indices * [22:42:57] 06Operations, 10Monitoring, 07Wikimedia-Incident: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#1570456 (10Dzahn) How would you manually check whether they are frozen and for how long? [22:52:37] 06Operations, 10Monitoring, 10RESTBase-Cassandra: restbase/cassandra - multiple monitoring criticals - https://phabricator.wikimedia.org/T105216#2642336 (10Dzahn) 05Open>03Resolved a:03Dzahn meanwhile the original problems appear to be gone. now we just have the usual broken graphite check RESTBase H... [22:53:00] (03PS1) 10Andrew Bogott: Puppet Panel: Speed up role formatting [puppet] - 10https://gerrit.wikimedia.org/r/311057 (https://phabricator.wikimedia.org/T91990) [22:54:13] (03PS2) 10Andrew Bogott: Puppet Panel: Speed up role formatting [puppet] - 10https://gerrit.wikimedia.org/r/311057 (https://phabricator.wikimedia.org/T91990) [22:54:25] (03Abandoned) 10Andrew Bogott: Puppet Panel: Speed up loading of role documentation [puppet] - 10https://gerrit.wikimedia.org/r/311041 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [22:59:55] 06Operations: Add tmux to maps (or other) servers - https://phabricator.wikimedia.org/T106191#1461505 (10Dzahn) I see this nowadays "modules/base/manifests/standard_packages.pp: 'tmux'," So tmux is in standard_packages in the base module. that should be installed everywhere. [23:00:05] RoanKattouw, ostriches, MaxSem, awight, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160915T2300). Please do the needful. [23:00:29] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:02:41] 06Operations: Add tmux to maps (or other) servers - https://phabricator.wikimedia.org/T106191#2642388 (10Dzahn) 05Open>03Resolved a:03Dzahn Yep, confirmed it's installed on all maps-test machines. [23:05:20] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [23:05:24] 06Operations: long-running root console sessions - https://phabricator.wikimedia.org/T105869#1453417 (10Dzahn) So.. we still want the monitoring? Just let icinga run the same command you used with a small wrapper around it? [23:12:19] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:12:39] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:14:49] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [23:14:59] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [23:22:46] (03CR) 10Andrew Bogott: [C: 032] Puppet Panel: Speed up role formatting [puppet] - 10https://gerrit.wikimedia.org/r/311057 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [23:32:30] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2642443 (10GWicke) > While in a narrow sense this seems straightforward, a password... [23:38:20] !log legoktm@terbium:~$ foreachwiki extensions/WikimediaMaintenance/createExtensionTables.php babel # T145366 [23:38:21] T145366: Create and populate babel database table on Wikimedia wikis - https://phabricator.wikimedia.org/T145366 [23:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:30] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues