[00:00:29] (03PS1) 10Jforrester: [GovernanceWiki] Disable wgRawHTML, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462834 (https://phabricator.wikimedia.org/T201285) [00:02:58] <3 [00:03:03] plus all the 1's [00:03:45] (03CR) 10Brian Wolff: [C: 031] "<3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462834 (https://phabricator.wikimedia.org/T201285) (owner: 10Jforrester) [00:04:09] Huh, internalwiki and collabwiki both have raw html. I wonder what that is about [00:05:10] are they even used these days? [00:06:13] * bawolff has no idea [00:08:43] RC is dead on internal, not quite so on collab [00:08:46] (03PS3) 10Dzahn: mediawiki::web::prod_sites: convert wikinews.org [puppet] - 10https://gerrit.wikimedia.org/r/462480 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:09:11] bawolff: My hope is to delete the concept of raw HTML mode from MediaWiki, one day. [00:09:25] (03CR) 10Dzahn: "PS3: fixed sort_url/short_url typo using gerrit editor" [puppet] - 10https://gerrit.wikimedia.org/r/462480 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:09:31] DIE WITH FIRE [00:09:42] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikinews.org [puppet] - 10https://gerrit.wikimedia.org/r/462480 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:09:43] I mean, i guess its ok for third parties having some sort of internal wiki [00:09:52] Even then... [00:10:08] preferrably behind HTTP auth so that outsiders cannot interact with even the auth page [00:10:12] Only safe use case is personal, one-account wikis. [00:12:16] (03CR) 10Dzahn: mediawiki::web::prod_sites: convert mediawiki.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:13:16] (03PS2) 10Dzahn: mediawiki::web::prod_sites: convert wikisource.org [puppet] - 10https://gerrit.wikimedia.org/r/462486 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:14:17] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikisource.org [puppet] - 10https://gerrit.wikimedia.org/r/462486 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:14:32] (03PS2) 10Dzahn: mediawiki::web::prod_sites: convert wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/462487 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:15:31] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/462487 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:17:05] (03PS3) 10Dzahn: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:17:52] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:22:32] (03PS1) 10Dzahn: icinga::ircbot: move from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/462835 [00:23:18] (03CR) 10jerkins-bot: [V: 04-1] icinga::ircbot: move from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/462835 (owner: 10Dzahn) [00:24:10] (03CR) 10Dzahn: [C: 04-1] "ok, it complains rightfully that there is no call to Hiera for these parameters.. need to move it there" [puppet] - 10https://gerrit.wikimedia.org/r/462835 (owner: 10Dzahn) [00:28:45] (03CR) 10Dzahn: "i did not mean to cause the verified -1, in fact i tried using the gerrit inline editor hoping it would be smart about rebasing" [puppet] - 10https://gerrit.wikimedia.org/r/462480 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:29:47] (03CR) 10Dzahn: [C: 04-1] "current status: https://phabricator.wikimedia.org/T204056#4584966" [dns] - 10https://gerrit.wikimedia.org/r/459835 (https://phabricator.wikimedia.org/T204056) (owner: 10Reedy) [00:30:55] (03CR) 10Dzahn: "sorry if i made rebasing worse.. but there was that consistent typo in multiple ones" [puppet] - 10https://gerrit.wikimedia.org/r/462487 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [00:35:04] mutante: the in-line editor uses which ever parent the change is on [00:35:37] You have to rebase the change first if you want the inline editor to use the latest base [00:36:05] 10Operations, 10Core Platform Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [00:36:10] Hoping gerrit would be smart...That was the first problem [00:36:25] Lol [00:36:54] If it was that smart then they resolved merge conflicts :) [00:58:42] (03CR) 10Hashar: [C: 031] Install subversion on application servers [puppet] - 10https://gerrit.wikimedia.org/r/462673 (https://phabricator.wikimedia.org/T204801) (owner: 10Muehlenhoff) [01:01:14] ! [remote rejected] HEAD -> refs/for/master (internal server error: Error inserting change/patchset) [01:01:27] Never gotten that error before [01:01:50] bawolff: hi, fill it as a task against #gerrit :] [01:02:10] In umm, the mean time, how do I submit code? [01:02:16] Or do i just not? [01:03:05] Oh, second time's the charm [01:03:07] just saying, folks in charge of Gerrit are most probably not around at this time :) [01:03:17] it is pure luck I have seen your message (it is 3am here hehe) [01:03:47] Seems like the error went away, but I'll file a bug [01:05:32] T205503 [01:05:33] T205503: gerrit sometimes giving "internal server error: Error inserting change/patchset" - https://phabricator.wikimedia.org/T205503 [01:05:51] bawolff: thx :) [01:06:28] I will dig in the logs tomorrow [01:06:51] Thanks :) [01:30:04] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) a:05tramm>03CRoslof [01:32:48] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) The nutcracker issue is T204450 and could have also been solved by rebooting or applying the role without using spare first. [02:17:43] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12619/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/462833 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [02:32:32] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.22) (duration: 12m 42s) [02:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:05] (03PS1) 10Andrew Bogott: Horizon: move xtools to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/462841 (https://phabricator.wikimedia.org/T205232) [02:55:09] (03CR) 10Andrew Bogott: [C: 032] Horizon: move xtools to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/462841 (https://phabricator.wikimedia.org/T205232) (owner: 10Andrew Bogott) [02:55:52] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.23) (duration: 05m 10s) [02:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:46] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Wed Sep 26 03:06:46 UTC 2018 (duration 10m 54s) [03:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:21] (03PS2) 10Tim Starling: Add export logging channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460349 (https://phabricator.wikimedia.org/T203424) (owner: 10BPirkle) [03:16:42] (03CR) 10Tim Starling: [C: 032] Add export logging channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460349 (https://phabricator.wikimedia.org/T203424) (owner: 10BPirkle) [03:18:15] (03Merged) 10jenkins-bot: Add export logging channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460349 (https://phabricator.wikimedia.org/T203424) (owner: 10BPirkle) [03:20:20] (03CR) 10jenkins-bot: Add export logging channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460349 (https://phabricator.wikimedia.org/T203424) (owner: 10BPirkle) [03:21:25] !log tstarling@deploy1001 Synchronized wmf-config/InitialiseSettings.php: add export logging channel for g 461647 (duration: 00m 57s) [03:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:19] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529 (10Pine) For [[ https://lists.wikimedia.org/pipermail/education/2018-September/002082.html | this email]]: Received: from localhost ([::1]:54186 helo=fermium.wikimedia.org) by fer... [03:38:39] TimStarling: OK, to roll out a patch meanwhile? [03:39:52] * Krinkle staging on mwdebug2001 [03:41:08] you want me to look at the patch or are you just asking whether I'm doing anything conflicting? [03:41:20] Misread the log [03:41:23] thought you started a scap [03:41:30] just sync-file [03:41:30] And appeared to be aborted [03:41:33] given it wasn't running [03:41:40] Yeah, ok. [03:42:05] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.23/includes/page/ImageHistoryPseudoPager.php: T204796 - I17455fef0d8 (duration: 00m 58s) [03:42:09] I assume you're going to cherry-pick that patch to wmf.23 [03:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:14] T204796: Fatal error on File description for certain 'offset' values (TimestampException: Invalid timestamp) - https://phabricator.wikimedia.org/T204796 [03:42:14] Anyhow, I'm done now :) [03:44:00] you mean 461647? I guess I should, I told bpirkle he would have logs to look at tomorrow [03:46:24] ok, cherry picking and deploying it now [04:09:36] !log tstarling@deploy1001 Synchronized php-1.32.0-wmf.23/includes/specials/SpecialExport.php: enable export logging channel (duration: 00m 55s) [04:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:27] !log tstarling@deploy1001 Synchronized php-1.32.0-wmf.23/includes/specials/SpecialExport.php: enable export logging channel (attempt 2) (duration: 00m 51s) [04:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:39] (03PS1) 10Marostegui: db-eqiad.php: Repool some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462850 [05:14:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462850 (owner: 10Marostegui) [05:15:41] (03Merged) 10jenkins-bot: db-eqiad.php: Repool some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462850 (owner: 10Marostegui) [05:16:09] (03CR) 10jenkins-bot: db-eqiad.php: Repool some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462850 (owner: 10Marostegui) [05:17:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1088, db1098:3316 and db1098:3317 (duration: 01m 06s) [05:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:11] !log Deploy schema change on s8 eqiad, this will generate lag - T203709 [05:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:19] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [05:23:17] (03PS1) 10Marostegui: db-codfw.php: Depool db2085:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462851 [05:24:59] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2085:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462851 (owner: 10Marostegui) [05:26:17] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2085:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462851 (owner: 10Marostegui) [05:28:53] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2085:3311 (duration: 00m 56s) [05:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:16] !log Deploy schema change on db2085:3311 [05:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:21] (03CR) 10jenkins-bot: db-codfw.php: Depool db2085:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462851 (owner: 10Marostegui) [05:37:11] (03CR) 10Giuseppe Lavagetto: "Overall LGTM. A small stylistic suggestion in the comments." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462791 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [05:51:48] (03PS6) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikivoyage.org [puppet] - 10https://gerrit.wikimedia.org/r/461976 (https://phabricator.wikimedia.org/T196968) [05:57:59] (03PS7) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikivoyage.org [puppet] - 10https://gerrit.wikimedia.org/r/461976 (https://phabricator.wikimedia.org/T196968) [05:58:01] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: fix function call [puppet] - 10https://gerrit.wikimedia.org/r/462853 [06:03:51] (03CR) 10Giuseppe Lavagetto: [C: 031] "diffs LGTM: https://puppet-compiler.wmflabs.org/compiler1002/12621/" [puppet] - 10https://gerrit.wikimedia.org/r/461976 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:09:15] (03CR) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikiversity.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:12:02] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikiversity.org [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) [06:18:13] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikiversity.org [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) [06:19:37] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) @Jonas It looks like this didn't get into the 23 br... [06:23:54] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "the upload_rewrite parameter doesn't do the right thing here. We need to change its behaviour." [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:38:35] (03CR) 10Muehlenhoff: mediawiki::web::prod_sites: convert wikiversity.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462424 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:50:49] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/461976 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:51:49] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, 10User-Elukey: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10elukey) [06:55:09] (03PS1) 10Elukey: network::constants: remove analytics100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/462857 (https://phabricator.wikimedia.org/T205507) [06:56:11] (03CR) 10Elukey: [C: 032] network::constants: remove analytics100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/462857 (https://phabricator.wikimedia.org/T205507) (owner: 10Elukey) [06:58:15] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, and 2 others: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10elukey) [07:11:50] (03PS1) 10Volans: Custom fields: fix field type [software/netbox] - 10https://gerrit.wikimedia.org/r/462860 (https://phabricator.wikimedia.org/T199083) [07:14:35] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2085:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462862 [07:15:21] (03CR) 10Zoranzoki21: "Should this be removed and for other wikis which have wgRawHTML enabled? I ask because of:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462834 (https://phabricator.wikimedia.org/T201285) (owner: 10Jforrester) [07:15:32] (03CR) 10Zoranzoki21: [C: 031] [GovernanceWiki] Disable wgRawHTML, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462834 (https://phabricator.wikimedia.org/T201285) (owner: 10Jforrester) [07:16:22] (03PS1) 10Tarrow: Set testwikidatawiki CacheEpoch to 2018-09-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462864 (https://phabricator.wikimedia.org/T205330) [07:16:55] (03PS1) 10Tarrow: Set wikidatawiki CacheEpoch to 2018-09-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462865 (https://phabricator.wikimedia.org/T205330) [07:16:59] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2085:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462862 (owner: 10Marostegui) [07:17:29] (03CR) 10Addshore: [C: 032] Set testwikidatawiki CacheEpoch to 2018-09-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462864 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:17:55] o/ marostegui [07:18:03] addshore: up [07:18:14] (03PS1) 10Tarrow: Set wikidatawiki CacheEpoch to 2018-09-10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462866 (https://phabricator.wikimedia.org/T205330) [07:18:16] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2085:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462862 (owner: 10Marostegui) [07:18:29] got paged, where is icinga-wm? [07:18:34] (03PS1) 10Tarrow: Set wikidatawiki CacheEpoch to 2018-09-15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462867 (https://phabricator.wikimedia.org/T205330) [07:18:35] we are on db1115 [07:18:38] checking tendril seems down [07:18:39] ack, I'll fix icinga [07:18:55] (03PS3) 10Tarrow: Invalidate wikidatawiki cache with wgCacheEpoch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462728 (https://phabricator.wikimedia.org/T205330) [07:19:03] here too if needed [07:19:09] * addshore removes his +2 on that patch to wait for whatever is happening [07:19:10] I am on db1115 [07:19:19] w [07:19:30] server seems up [07:19:30] maybe network went down? [07:19:36] yeah [07:19:43] server seems up and with 81 days uptime [07:19:58] !log restarted ircecho on einsteinium "icinga-wm quit (Reason: Remote host closed the connection)" [07:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:06] mysql uptime on db1115 is high [07:20:10] so the server is ok [07:20:29] let's ping it from icinga [07:20:40] volans: we may have icinga network issues [07:20:52] jynus: I'm pinging it, all good [07:20:54] ping works from there [07:21:42] mmmh icinga-wm not reconnecting but ofc doesn't log any error :( [07:21:43] There are some dropped packets on db1115 but just 8 and those could be old [07:21:58] godog: could you check the IRC side just to exclude it was banned? [07:21:58] also unpollable from dbmonitor, but ping works [07:22:06] volans: sure, checking [07:22:09] thx [07:22:20] I am going to downtime db1115 to avoid more pages [07:22:31] event scheduler seems down [07:22:35] that is weird [07:23:27] it is off [07:23:30] why? [07:23:39] jynus: Sep 26 07:15:41 db1115 mysqld[3357]: 2018-09-26 7:15:41 139709208505088 [ERROR] Event_scheduler::execute_top: Can not create event worker thread (errno=11). Stopping event scheduler [07:23:53] Checking HW logs just in case [07:23:53] but it is not the only issue [07:24:10] db1115 services are down from einstenium perspective [07:24:23] something else is happening [07:25:08] I am running the nagios checks of the host locally, but they all seems ok [07:25:09] HealthState = 25 (Critical failure) [07:25:12] I am checking what that is [07:25:19] [Sun Sep 2 16:01:16 2018] md: md2: data-check done. [07:25:33] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2085:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462862 (owner: 10Marostegui) [07:25:46] marostegui: where do you get that, IPMI? [07:25:48] volans: sigh, so icinga-wm was in #wikimedia-overflow and this channel is +r, iirc ircecho doesn't IDENTIFY [07:25:56] volans: thus I don't think it can rejoin here [07:26:03] jynus: yeah, checking what that is related to [07:26:15] godog: I think they were working on adding auth to ircecho, don't think is in prod yet [07:26:32] icinga cannot ping ms-be2027.codfw.wmnet [07:26:33] volans: ok, I'm going to -r temporarily [07:26:40] godog: ack I'll restart it [07:27:05] there we go [07:27:10] * godog looks at the pile of tech debt our irc bots are [07:27:46] volans: ms-be2027 is also down [07:27:50] marostegui: should we stop mysql? [07:27:56] that error seems old [07:28:02] it is not doing what we want anyway [07:28:08] I'll take a look at ms-be2027 [07:28:10] seems related to a memory error from july [07:28:13] moritzm: yeah I was checking it now, cannot shs [07:28:13] jynus: +1 [07:28:17] maybe nrpe failed [07:28:26] jynus: I would stop mysql and reboot the server [07:28:28] maybe disk failed and it is creating a weird state [07:28:33] but we don't have th visibility [07:28:34] just noticed it when checking why Cumin failed to connect from cumin1001 [07:28:41] eheheh [07:28:45] !log stoping mariadb at db1115 [07:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:18] jynus: /proc/mdstat looks clean for now [07:29:34] mysql stopped [07:29:39] let's reboot? [07:29:49] I am on the idrac, I can attach myself to the console and watch the boot [07:29:53] one sec [07:29:57] !log repair /dev/sdh1 on ms-be2041 - T199198 [07:30:03] want to umount volumes [07:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:05] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [07:30:09] jynus: good idea [07:30:32] The only HW error I can see is an old error from july, related to a DIMM [07:30:40] RecordData = Correctable memory error rate exceeded for DIMM_B2. [07:30:54] yeah, saw that [07:31:21] ok, I am ready to reboot when you are [07:31:34] (03CR) 10Smalyshev: Add lexemes dump as separate dump (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/461862 (https://phabricator.wikimedia.org/T202830) (owner: 10Smalyshev) [07:31:46] (03PS4) 10Smalyshev: Add lexemes dump as separate dump [puppet] - 10https://gerrit.wikimedia.org/r/461862 (https://phabricator.wikimedia.org/T202830) [07:31:49] jynus: go for it [07:31:52] volans: going to leave the channel -r for now [07:31:58] !log restart db1115 [07:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:18] godog: ack, but what changed since yesterday? [07:32:26] godog: ok, I've left the mgmt [07:32:40] but system seems to have crashed hard, could not get a console even [07:33:07] jynus: system booting up, let's see... [07:33:11] icinga-wm left at 23:59:41 CEST [07:33:19] so 21:59:41 UTC [07:33:44] in theory you have the log of events at icinga history [07:34:10] jynus: it is booting up fine [07:34:24] server upo [07:34:25] marostegui: BTW, the downtime won't prevent from further paging [07:34:55] !log powercycle ms-be2030 - no console [07:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:35] volans: good question, no idea [07:35:43] so it seems to me that network stop responding on db1115 [07:36:15] * volans restarted it again [07:36:20] jynus: but ping was up [07:36:33] tcp stopped responding? [07:36:42] or a firewall issue? [07:36:48] 10Operations, 10netops: Enable cumin1001 in router ACLs - https://phabricator.wikimedia.org/T205513 (10MoritzMuehlenhoff) [07:36:55] RECOVERY - Host ms-be2027 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [07:36:56] !log stopping s2 replication on dbstore2002 (compressing tables) [07:36:59] I will start mariadb now unless you want to check something [07:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:07] jynus: go for it [07:37:43] tail the mariadb log [07:37:54] 10Operations, 10Analytics, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) Getting back to this task: the plan is now to "free" stat1005 and move all users to a new host, stat1007. This will allow us to reboot/compile/etc.. on stat1005 without impac... [07:37:55] tailing [07:38:07] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) [07:38:19] jynus: the critical state on HW logs recovered [07:38:26] HealthState = 5 (OK) [07:38:28] After the reboot [07:38:55] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) Any chance that this work can be done before the end of next week? If so I'll plan some maintenance time for Hadoop :) [07:40:05] RECOVERY - HTTP-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 77211 bytes in 1.344 second response time [07:40:25] RECOVERY - HTTP-dbtree on dbmonitor2001 is OK: HTTP OK: HTTP/1.1 200 OK - 78692 bytes in 1.518 second response time [07:40:46] things seem back to normal [07:40:58] addshore: I think you can go ahead [07:41:01] maybe an OOM? [07:41:01] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2085:3311 (duration: 00m 56s) [07:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:08] jynus: not on dmesg [07:41:12] marostegui: thanks! :) the testwikidata patch shouldnt really do anything anyway :) [07:41:13] no OOM logged there [07:41:17] I know, but want to check machine stats [07:41:24] jynus: event scheduler started? [07:41:30] yes [07:41:34] (03PS2) 10Addshore: Set testwikidatawiki CacheEpoch to 2018-09-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462864 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:41:38] (03CR) 10Addshore: [C: 032] Set testwikidatawiki CacheEpoch to 2018-09-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462864 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:41:58] the maraidb logs are full with errors about the event Scheduler on db1115 [07:42:05] banyek: that is "normal" [07:42:10] :( [07:42:18] banyek: marostegui: can you check the tendril status [07:42:24] I was [07:42:25] (03PS1) 10Muehlenhoff: Add /var/log/debdeploy/ to directories for the server package [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/462870 [07:42:25] works fine for me [07:42:31] if everthing that is lagging should be, etc [07:42:35] yep [07:42:41] Those lagging hosts are s8 [07:42:41] db1092 seems down [07:42:45] and they should be lagging [07:42:48] (03Merged) 10jenkins-bot: Set testwikidatawiki CacheEpoch to 2018-09-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462864 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:43:01] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add /var/log/debdeploy/ to directories for the server package [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/462870 (owner: 10Muehlenhoff) [07:43:03] (03CR) 10Volans: [C: 031] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/462870 (owner: 10Muehlenhoff) [07:43:04] yes, I am checking db1092, which should be lagging, but not down [07:43:09] actually tendril feels a bit slower than usual - I don't have metric, just the UX feel [07:43:29] db1092 is definitely down [07:43:32] I will investigate [07:43:42] banyek: can you downtime db1092? [07:43:50] it is already downtime [07:43:52] I'm on it [07:43:58] Ah yes, I downtimed all s8 in the morning [07:44:01] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10Patch-For-Review: rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) - https://phabricator.wikimedia.org/T201346 (10MoritzMuehlenhoff) I've created T205513 for the router changes and made some tests with Cumin and debdeploy, loo... [07:44:05] ok [07:44:08] marostegui: why all services? [07:44:13] jynus: s8 eqiad [07:44:24] do only lag [07:44:33] or you may miss other problems [07:44:33] I was using icinga-downtime [07:45:01] it is almost faster to use the search function, if you disable javascript [07:45:13] searching for "MariaDB Slave Lag: s8" [07:45:15] db1092 is frozen [07:45:46] is it also down or only network? [07:45:47] description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support [07:46:15] I will get all the info and get a task for chris [07:46:19] this host is under warranty [07:46:23] !log repair /dev/sdd1 on ms-be1043 - T199198 [07:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:32] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [07:47:05] !log reboot an-master1002 as attempt to clear out some systemd@fsck alarms [07:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:24] I think I will remove db1011 and db1118 from tendril, to minimize event errors [07:48:55] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:462864|Set testwikidatawiki CacheEpoch to 2018-09-19]] T205330 (duration: 00m 56s) [07:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:03] T205330: Adding a value to a statement prompts for a property - https://phabricator.wikimedia.org/T205330 [07:50:14] RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational [07:50:51] !log Hard reset db1092, server crashed - T205514 [07:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:59] T205514: db1092 crashed - https://phabricator.wikimedia.org/T205514 [07:53:46] (03CR) 10jenkins-bot: Set testwikidatawiki CacheEpoch to 2018-09-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462864 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:59:52] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462872 (https://phabricator.wikimedia.org/T205514) [08:00:18] !log donwtiming db1107 as upgrade (kernel & mariadb) [08:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:39] (03Abandoned) 10Muehlenhoff: Install subversion on application servers [puppet] - 10https://gerrit.wikimedia.org/r/462673 (https://phabricator.wikimedia.org/T204801) (owner: 10Muehlenhoff) [08:01:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462872 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [08:02:15] RECOVERY - Filesystem available is greater than filesystem size on ms-be1043 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1043&var-datasource=eqiad%2520prometheus%252Fops [08:02:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462872 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [08:03:56] Մ���:L���������O� [08:04:00] �/��`�cH�0���s�Z4e��s�ߓ�h;}����<�£��V��� �=Ǣ�V��q�쮋)!u���\N�J0 &��9~�v�-e�16����f�N��,z�dz�2����~��]�J֥ �Z��"�!��~wV_ ��7<�� [08:04:03] �~�W�r�6Ux����^o�q���ni����]#2��xa/a�Q��WP����Q��$�� [08:04:05] ����q$�VwA� [08:04:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1092, server crashed - T205514 (duration: 00m 56s) [08:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:25] (03PS1) 10Jcrespo: mariadb: Disable notifications on db1092, crashed [puppet] - 10https://gerrit.wikimedia.org/r/462873 (https://phabricator.wikimedia.org/T205514) [08:04:26] T205514: db1092 crashed - https://phabricator.wikimedia.org/T205514 [08:04:42] !log stop eventlogging_sync on db1108 and the mysql kafka consumers on eventlog1002 as prep step for db1107 maintenance [08:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:23] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1092 crashed - https://phabricator.wikimedia.org/T205514 (10Marostegui) p:05Triage>03Normal a:03Cmjohnson @Cmjohnson looks like we need a new BBU. This host is under warranty, can you talk to HP and see if we can get a new BBU before 10th Oct (a... [08:07:35] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [08:07:54] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [08:08:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462872 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [08:08:14] banyek: ^ that you [08:08:15] ^ banyek I assume scheduled? [08:08:31] yeah he is working on db1107 [08:08:42] it was scheduled [08:08:49] ��Q�.O��i���?4ve��*߷�����9�1�t�)��>��|er �fON�U'cy� UK��N�==�'$c��������<R*XZ���΁T�N���Ld'�>'Q�T�� [08:08:49] {M^�7�4�w�j^GEM�,'Jv�B��Xr�(��^Ȕ��������j�� �<q,�pA���:�E@Rvn���$�,���?���B��i�YӚi�����&j\nc���e�|�)�p����=;�3���d��o2�аF�OO61-�g [08:08:52] M�����S1W����}jҡeV)���d�!�tu��,�y ȂI�?#!�k�K�h�AhH�����EZ�R�$� [08:08:54] ��8#u��ߠ�dX��g��-� [08:08:57] W�h2� [08:08:59] sigh, ok setting +r [08:08:59] I was not aware of the haprox checks sorry [08:09:07] remember to upgrade both kernel and mariadb [08:09:15] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.40 seconds [08:12:59] banyek: make sure to reload the proxy once you are done [08:13:21] because it won't failback to db1107 right? [08:13:32] (03PS3) 10Muehlenhoff: Enable cumin1001 and cumin2001 as mysql maintenance clients [puppet] - 10https://gerrit.wikimedia.org/r/460323 (https://phabricator.wikimedia.org/T177385) [08:13:38] We don't use the proxy for m4, but just to leave it in a healthy state with both servers up [08:14:02] (03CR) 10Volans: "I'm still wondering if all this is really needed as it seems quite a waste to me to pass this parameter in all those resources to just end" [puppet] - 10https://gerrit.wikimedia.org/r/462793 (owner: 10Cwhite) [08:14:10] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10fgiunchedi) [08:14:37] !log Restarting CI Jenkins on contint1001 [08:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:47] (03CR) 10Jcrespo: [C: 032] mariadb: Disable notifications on db1092, crashed [puppet] - 10https://gerrit.wikimedia.org/r/462873 (https://phabricator.wikimedia.org/T205514) (owner: 10Jcrespo) [08:14:54] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.46 seconds [08:15:04] (03CR) 10Volans: [C: 04-1] "See first my comment on Iacf519dbed92dadbc9603ef4a90928b4ee21cb28, maybe all this is not needed in the end." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/462791 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [08:15:21] !log db1107 upgrade finished [08:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:19] (03CR) 10Muehlenhoff: "cumin1001 is now also set up, so I've updated the grants patch to also include cumin1001." [puppet] - 10https://gerrit.wikimedia.org/r/460323 (https://phabricator.wikimedia.org/T177385) (owner: 10Muehlenhoff) [08:17:04] !log db1107 upgrade finished (T205288) [08:17:06] ACKNOWLEDGEMENT - HP RAID on db1092 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T205516 [08:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:12] T205288: Maintenance M4 cluster - https://phabricator.wikimedia.org/T205288 [08:17:12] 10Operations, 10ops-eqiad: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T205516 (10ops-monitoring-bot) [08:17:25] jynus: ^ [08:17:27] interesting [08:17:33] Ah right [08:17:36] Battery count = 0 [08:17:42] Same thing we saw, BBU broken [08:18:13] let me get the health logs, as the errors says they will ask for that [08:18:37] great! [08:18:44] that will speed up the process for chris! [08:19:10] 10Operations, 10ops-eqiad: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T205516 (10Marostegui) [08:19:16] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1092 crashed - https://phabricator.wikimedia.org/T205514 (10Marostegui) [08:19:25] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Marostegui) [08:21:25] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.64 seconds [08:21:42] !log rebooting mw1250-mw1269 for kernel security updates [08:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:09] !log start eventlogging_sync on db1108 and the mysql kafka consumers on eventlog1002 after db1107 maintenance [08:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:34] RECOVERY - Filesystem available is greater than filesystem size on ms-be2041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [08:24:21] !log Restarting CI Jenkins on contint1001 [#2] [08:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:25] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [08:27:25] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 [08:27:44] marostegui,jynus - why do we have even if we don't use them for m4 ? [08:27:50] have --^ [08:28:14] elukey: We used to use them until a few months ago, when we (I) made a split brain [08:28:41] elukey: Ideally all the services should have an haproxy in front to have automatic failover, even core! [08:29:12] elukey: don't underestimage who is using a service and assumes an old arch [08:29:32] we will change it when we decom the proxies [08:29:41] but until now, we will keep the old structure [08:29:49] *until then [08:31:11] sure sure I was only asking why it was up, just to know the gears around db1107/8, I wasn't underestimating anything :) [08:31:15] thanks for the explanation [08:32:07] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10jcrespo) {F26209361} AHS log on google drive, cannot attach it on phabricator: https://drive.google.com/open?id=1Y9RikXhRlZHY-7gNN0MeK5AdOjswPXJH [08:35:58] !log restarting blazegraph and updater on wdqs2003 [08:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:49] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Marostegui) [08:38:07] (03CR) 10Jcrespo: "Ok, deploying and merging. Was firewall and roles applied too?" [puppet] - 10https://gerrit.wikimedia.org/r/460323 (https://phabricator.wikimedia.org/T177385) (owner: 10Muehlenhoff) [08:38:54] (03CR) 10ArielGlenn: Add lexemes dump as separate dump (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/461862 (https://phabricator.wikimedia.org/T202830) (owner: 10Smalyshev) [08:44:50] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529 (10Bawolff) Huh. So the from headers aren't being spoofed. Only the human readable name is misleading. So if this is a list that only allows posting by members, and info@litwestl... [08:49:58] (03CR) 10Banyek: [C: 031] mariadb: fix typo in template max_allowed_packet [puppet] - 10https://gerrit.wikimedia.org/r/462769 (owner: 10Arturo Borrero Gonzalez) [08:50:28] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 707.95 seconds [08:50:36] (03CR) 10Jcrespo: [C: 031] mariadb: fix typo in template max_allowed_packet [puppet] - 10https://gerrit.wikimedia.org/r/462769 (owner: 10Arturo Borrero Gonzalez) [08:50:58] (03PS3) 10Arturo Borrero Gonzalez: mariadb: fix typo in template max_allowed_packet [puppet] - 10https://gerrit.wikimedia.org/r/462769 [08:52:02] (03CR) 10Arturo Borrero Gonzalez: [C: 032] mariadb: fix typo in template max_allowed_packet [puppet] - 10https://gerrit.wikimedia.org/r/462769 (owner: 10Arturo Borrero Gonzalez) [08:58:32] <_joe_> !log disabling puppet on all hosts with a MediaWiki setup before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/453093 [08:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:22] (03PS1) 10Marostegui: db-eqiad.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462882 [09:01:24] (03PS10) 10Giuseppe Lavagetto: mediawiki: move php to a profile, use the php class [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) [09:03:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462882 (owner: 10Marostegui) [09:03:41] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: move php to a profile, use the php class [puppet] - 10https://gerrit.wikimedia.org/r/453093 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [09:04:25] <_joe_> arturo: can I merge your patch too? [09:04:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462882 (owner: 10Marostegui) [09:06:06] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2092 (duration: 00m 56s) [09:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:15] (03CR) 10Muehlenhoff: "Yeah, the firewall changes got merged via 5cfb53eab9 yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/460323 (https://phabricator.wikimedia.org/T177385) (owner: 10Muehlenhoff) [09:06:48] !log Deploy schema change on db2092 - T203709 [09:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:56] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [09:07:27] (03CR) 10Arturo Borrero Gonzalez: "It's weird to see new code (and hiera keys) being introduced with the 'labs' namespace. We are now using the 'cloud' or 'wmcs' namespaces." [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [09:10:09] (03CR) 10jenkins-bot: db-eqiad.php: Depool db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462882 (owner: 10Marostegui) [09:13:07] (03PS4) 10Jcrespo: Enable cumin1001 and cumin2001 as mysql maintenance clients [puppet] - 10https://gerrit.wikimedia.org/r/460323 (https://phabricator.wikimedia.org/T177385) (owner: 10Muehlenhoff) [09:13:51] 10Operations, 10IRCecho: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10fgiunchedi) [09:13:55] volans: ^ [09:14:00] thx [09:15:50] godog: we could apply the same fix that was done in fundraising [09:16:30] which fix? [09:16:34] godog: T202314 [09:16:35] T202314: icingawm is missing from #wikimedia-fundraising channel - https://phabricator.wikimedia.org/T202314 [09:17:15] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: fix name of the pdo-mysql extension [puppet] - 10https://gerrit.wikimedia.org/r/462885 [09:17:16] I can see it rejoined in #wikimedia-fundraising but not in #wikimedia-analytics [09:17:40] shinken too is quitting ^^^ [09:17:53] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: horizon: move wikimania-support to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/462887 (https://phabricator.wikimedia.org/T204745) [09:18:17] suspicious, perhaps some countermeasure from freenode [09:19:47] godog: ok saw spam in fundraising so it's not +r clearly [09:19:54] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::php: fix name of the pdo-mysql extension [puppet] - 10https://gerrit.wikimedia.org/r/462885 (owner: 10Giuseppe Lavagetto) [09:20:17] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: horizon: move wikimania-support to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/462887 (https://phabricator.wikimedia.org/T204745) (owner: 10Arturo Borrero Gonzalez) [09:21:36] <_joe_> and ofc I goit it wrong [09:23:51] (03PS2) 10Giuseppe Lavagetto: php: do not escape underscores in safe names [puppet] - 10https://gerrit.wikimedia.org/r/462885 [09:25:07] volans: sigh, I'm going to try the +e here and see what happens [09:25:22] (03CR) 10Jcrespo: [C: 032] Enable cumin1001 and cumin2001 as mysql maintenance clients [puppet] - 10https://gerrit.wikimedia.org/r/460323 (https://phabricator.wikimedia.org/T177385) (owner: 10Muehlenhoff) [09:25:23] godog: I'm running an strace on ircecho waiting that it restarts [09:25:27] to see if I'll get something [09:25:35] (03PS5) 10Jcrespo: Enable cumin1001 and cumin2001 as mysql maintenance clients [puppet] - 10https://gerrit.wikimedia.org/r/460323 (https://phabricator.wikimedia.org/T177385) (owner: 10Muehlenhoff) [09:26:36] ack [09:29:12] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12625/mw1261.eqiad.wmnet/ shows the correct desired change." [puppet] - 10https://gerrit.wikimedia.org/r/462885 (owner: 10Giuseppe Lavagetto) [09:29:21] (03PS3) 10Giuseppe Lavagetto: php: do not escape underscores in safe names [puppet] - 10https://gerrit.wikimedia.org/r/462885 [09:29:45] !log restarting icinga at einsteinium, unresponsive to commands [09:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:01] 10:29 :: #wikimedia-operations: ban exception *!*@*.wikimedia.org [by wolfe.freenode.net, 2128903 secs ago] [09:30:07] volans: ^ so yeah clearly not working [09:30:22] godog you should be able to register the bot now [09:30:28] like shinken is logged in. [09:31:21] paladox: thanks, yeah we'll have to do that for sure [09:31:29] paladox: btw shinken too disconnected and reconnected 5 times today [09:31:42] so something's else is also going on probably [09:32:00] yeh i guess. [09:32:29] btw, why shinken connects here? :) [09:34:55] godog: if the ban exemption isn't working, it might be trying to join before id'ing [09:34:58] (03CR) 10Banyek: "firstly: yet it contains the service, that should be a leftover comment" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [09:35:25] p858snake|L: there's no identifying yet [09:35:26] * godog brb [09:40:12] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db2092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462896 [09:42:19] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db2092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462896 (owner: 10Marostegui) [09:43:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db2092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462896 (owner: 10Marostegui) [09:45:54] !log rebooting conf1004 for kernel security update [09:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:21] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2092 (duration: 00m 56s) [09:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:57] (03CR) 10Marostegui: "> firstly: yet it contains the service, that should be a leftover" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [09:48:08] volans: any joy with the strace? [09:48:31] godog: ofc it didn't fail yet [09:48:50] you know, it's quantistic, looking at it makes it working :D [09:50:06] lol, it just quit [09:50:14] so we have to _talk_ about it and observe [09:50:16] yep! [09:51:59] !log rebooting conf1005 for kernel security update [09:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:29] 10Operations, 10IRCecho: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10Volans) This is an strace snip of the bot exiting (put the window at max width for better reading): ``` [pid 4596] 09:49:56 select(25, [24], [], [], {0, 200000})... [09:52:32] godog: ^^^ pasted there [09:54:35] godog: I've an idea, not sure if true [09:54:54] if in any of the channels where icinga-wm is present there is the new spam with those invalid unicode chars [09:54:57] it failes [09:54:58] *fails [09:55:41] I assume that being connected to the chan it anyway 'reads' the channel although it doesn't actually need to as it's not interactive [09:56:28] the time matches spam in the fundraising chan [09:56:40] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db2092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462896 (owner: 10Marostegui) [09:57:12] intriguing [09:58:05] I might be wrong ofc :) [09:58:31] again, spam in fundraising + quit [09:58:46] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385 (10jcrespo) I deployed the grants to the new hosts but 1) it needs to change for labs 2) some hosts had errors: ```lines=10 root@neodymium:~/software/dbtools$ /us... [10:01:22] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385 (10jcrespo) db1107 db1118 (because it is MySQL 8.0 and has a different syntax) labservices1001 labservices1002 labtestservices2001 (because the 3 have a wikimedi... [10:03:31] volans: maybe try having it just join in here and see how long it lasts? [10:03:41] and not the other channels [10:08:18] !log rebooting conf1006 for kernel security update [10:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:01] p858snake|L: yeah, but the other channels wants that too :D [10:10:00] well its a test case to see if the unicode is causing it [10:10:09] I'm checking the other channels anyway [10:12:47] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Jonas) [10:12:50] godog: so, of all the channels icinga-wm should be in, it's actually in only in: #wikidata, #wikimedia-reading-web-bots, #wikimedia-perf-bots, #wikimedia-fundraising [10:13:11] and on #wikimedia-ai there is an icinga2-wm [10:13:12] <_joe_> !log reenabling puppet on the MediaWiki hosts [10:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:23] volans: lol, first time I hear about icinga2-wm and it has got a nick registered and everything [10:14:26] what the [10:14:37] (03PS1) 10Mathew.onipe: Switch public cluster to Kafka event source [puppet] - 10https://gerrit.wikimedia.org/r/462907 (https://phabricator.wikimedia.org/T189458) [10:16:35] godog: volans thats me [10:16:44] That runs icinga2 in wmcs [10:17:58] yeah I got to the same conclusion googling for icinga2-wm, thanks paladox [10:18:30] so if we upgrade to icinga2 in prod we need to find another name :D [10:20:33] paladox: is shinken-wm also ircecho? [10:23:20] I thik so yeah, from the ircname [10:23:35] so the fact that it's quitting too seems to back my idea [10:25:35] indeed [10:26:56] !log rebooting mw1270-mw1290 for kernel security updates [10:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:28] !log stop ircecho on einsteinium to register the nickname [10:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:07] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10elukey) Any news from Dell ? [10:30:55] !log started ircecho on einsteinium [10:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:16] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) An update on this. We have pretty much completed all the tasks we had scheduled for the failover. We are now advancing on other tasks, to complete them faster... [10:41:20] 10Operations, 10IRCecho: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10Volans) My theory so far checking also the timing is that some of the Freenode spam with broken unicode makes the bot crash. Apparently the bot "reads" the channel although it doesn't need to as it's not in... [10:47:11] volans: yes [10:48:38] 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10Volans) [10:49:03] 10Operations: Register and identify icinga-wm - https://phabricator.wikimedia.org/T205526 (10fgiunchedi) [10:50:15] paladox: so it seems to me that ircecho "reads" the channel traffic [10:50:29] and that the recent spam on freenode with broken unicode breaks it [10:53:05] Oh [10:56:36] anybody with experience debugging rabbitmq issues would like to help me with debugging an apparent problem in a queue? [10:57:06] 10Operations: Register and identify icinga-wm - https://phabricator.wikimedia.org/T205526 (10fgiunchedi) I've registered the nick with email `noc+icinga-wm@wikimedia.org` since `noc@wikimedia.org` couldn't be used `12:28 -NickServ(NickServ@services.)- noc@wikimedia.org has too many accounts registered.` Though a... [10:57:10] 10Operations: Register and identify icinga-wm - https://phabricator.wikimedia.org/T205526 (10fgiunchedi) [10:57:59] https://phabricator.wikimedia.org/T205524#4618563 <--- this [10:59:40] arturo: super ignorant - I am seeing that the l3-agent gets a timeout no? [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T1100). [11:00:04] Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] I would like to see what's going on in the queue, ideally I would like to see actual messages flowing in there, like a debug mode or something [11:01:00] arturo: rabbitmq has it's own webui if enabled [11:01:29] volans: I don't thing we have that enabled [11:01:38] volans: does that contains debug info? [11:02:26] hm, zoranzoki21 is not around for swat... [11:02:39] it shows you the queues, messages, etc.. depends what you need [11:02:43] arturo: might be worth to check if something before the queue stops the TCP connection to happen (for some reason) [11:03:49] arturo: see also rabbitmqctl [11:03:55] elukey: well, tcpdump is showing weird TCP packets with a lot of [P.] flags, not sure what they mean in this context [11:07:12] (03PS1) 10Mholloway: Fix: Regenerate map tiles up to zoom level 9 with notify-tilerator-regen [puppet] - 10https://gerrit.wikimedia.org/r/462918 (https://phabricator.wikimedia.org/T202201) [11:09:58] (03PS29) 10Banyek: wikireplicas: Config template generation for wmf-pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) [11:11:14] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Upgrade Cumin masters to stretch - https://phabricator.wikimedia.org/T177385 (10jcrespo) This should be for the most part fixed, although showed some bugs on the implementation that may need to be fixed later, but for now mysqls can be quer... [11:11:16] !log disabled puppet on einsteinium to test ircecho failure, I'll re-enable it in max ~30m [11:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:22] Hi, I am here [11:12:24] Sorry for lating [11:12:30] I had problems with SWATing [11:13:01] Who swating? [11:13:08] Zoranzoki21: zeljkof [11:13:21] was looking for you before [11:13:30] volans: I had problems with IRC [11:13:46] Zoranzoki21: I'm around and in swatting [11:14:00] I'll ping you in about 5 minutes when the patch is ready for testing [11:15:42] (03CR) 10Banyek: "I am not even sure if the long running query killing only affects the labs/cloud/wmcs hosts so I'd prefer to merge it now, and do the refa" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [11:17:12] (03CR) 10Gehel: [C: 031] "LGTM, let's just wait for confirmation from Stas and deploy this tonight." [puppet] - 10https://gerrit.wikimedia.org/r/462907 (https://phabricator.wikimedia.org/T189458) (owner: 10Mathew.onipe) [11:20:58] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462191 (https://phabricator.wikimedia.org/T205206) (owner: 10Zoranzoki21) [11:21:58] Hallo. I used to run queries for statistics about Content Translation on terbium. Then I switched to mwmaint1001. Now it says "Do not use this server" in big ASCII art. Kart_ recommended that I use mwmaint1002.codfw.wmnet instead, and it worked yesterday, but now it says: [11:22:23] channel 0: open failed: administratively prohibited: open failed. stdio forwarding failed. ssh_exchange_identification: Connection closed by remote host. [11:22:39] probably mwmaint2001.codfw.wmnet [11:23:07] (03PS1) 10Urbanecm: Lift account creation IP cap for 2018-09-26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462928 (https://phabricator.wikimedia.org/T205529) [11:23:46] (03CR) 10Zfilipin: Enable VisualEditor in Project namespace on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462191 (https://phabricator.wikimedia.org/T205206) (owner: 10Zoranzoki21) [11:23:53] (03PS3) 10Zfilipin: Enable VisualEditor in Project namespace on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462191 (https://phabricator.wikimedia.org/T205206) (owner: 10Zoranzoki21) [11:24:03] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462191 (https://phabricator.wikimedia.org/T205206) (owner: 10Zoranzoki21) [11:24:07] mark: OK, appears to work. Thanks. [11:24:16] jouncebot, now [11:24:16] For the next 0 hour(s) and 35 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T1100) [11:24:18] jouncebot, next [11:24:18] In 0 hour(s) and 35 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T1200) [11:24:20] jouncebot's doing nothing. [11:24:29] Urbanecm: i saw those...? [11:25:06] addshore, thank you, seems I have slow connection, 30 s is big delay nowadays. [11:25:11] :D [11:25:19] (03Merged) 10jenkins-bot: Enable VisualEditor in Project namespace on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462191 (https://phabricator.wikimedia.org/T205206) (owner: 10Zoranzoki21) [11:25:46] zeljkof, can you please deploy 462982 for T205529? It's last minute throttle request, needed in 1 hour approx. [11:25:47] T205529: Lift account creation IP cap for 2018-09-26 - https://phabricator.wikimedia.org/T205529 [11:26:11] Urbanecm: sure, please add to the calendar [11:26:13] will do [11:27:12] Urbanecm: just a note, because of datacenter switch, debug server is mwdebug2001.codfw.wmnet (alias mw2017.codfw.wmnet) [11:27:18] in case I forgot to mention it later [11:27:19] (03CR) 10jenkins-bot: Enable VisualEditor in Project namespace on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462191 (https://phabricator.wikimedia.org/T205206) (owner: 10Zoranzoki21) [11:27:40] (it's 462928 not 462982, sorry) [11:28:14] zeljkof, thanks, IIRC you can push throttle rules straightly to production [11:28:31] (untestable anyway, as it is not effective unless request's coming from right IP and in right timezone) [11:28:35] Urbanecm: ah, didn't see it's throttle rule [11:29:54] what's this in the logs now!? [11:29:59] `data error in /srv/mediawiki/php-1.32.0-wmf.22/extensions/Graph/includes/ApiGraph.php on line 125` [11:30:27] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:462191|Enable VisualEditor in Project namespace on srwiki (T205206)]] (duration: 00m 57s) [11:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:35] T205206: Enable VisualEditor in namespace of project (Википедија) on srwiki - https://phabricator.wikimedia.org/T205206 [11:30:47] Zoranzoki21: deployed [11:31:07] zeljkof: Ok, checking [11:31:27] anybody knows what's wrong with ApiGraph.php?! cc moritzm [11:32:31] hm, seems to have gone away [11:32:40] checking phab [11:33:26] zeljkof, will be absent while deploying, eating. Please deploy without me if possible. Thanks! [11:38:35] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462928 (https://phabricator.wikimedia.org/T205529) (owner: 10Urbanecm) [11:39:33] (03Merged) 10jenkins-bot: Lift account creation IP cap for 2018-09-26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462928 (https://phabricator.wikimedia.org/T205529) (owner: 10Urbanecm) [11:42:08] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:462928|Lift account creation IP cap for 2018-09-26 (T205529)]] (duration: 00m 56s) [11:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:16] T205529: Lift account creation IP cap for 2018-09-26 - https://phabricator.wikimedia.org/T205529 [11:42:33] (03CR) 10jenkins-bot: Lift account creation IP cap for 2018-09-26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462928 (https://phabricator.wikimedia.org/T205529) (owner: 10Urbanecm) [11:42:47] ApiGraph.php error in logs is T184128 [11:42:48] T184128: "PHP Warning: data error" from gzdecode() in ApiGraph.php and ApiQueryMapData.php - https://phabricator.wikimedia.org/T184128 [11:42:58] Urbanecm: deployed! [11:43:12] !log EU SWAT finished [11:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:52] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: nathante/groceryheist shell request for researchers, statistics-privatedata-users, analytics-privatedata-users - https://phabricator.wikimedia.org/T204790 (10MoritzMuehlenhoff) [11:50:26] 10Operations, 10monitoring, 10Patch-For-Review, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:51:27] 10Operations, 10Thumbor: in Commons, some PDFs are failing to render thumbnails. - https://phabricator.wikimedia.org/T203402 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T1200) [12:12:37] (03CR) 10Gehel: [C: 04-1] Fix: Regenerate map tiles up to zoom level 9 with notify-tilerator-regen (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462918 (https://phabricator.wikimedia.org/T202201) (owner: 10Mholloway) [12:20:27] (03PS1) 10Filippo Giunchedi: icinga: use IRC account password [puppet] - 10https://gerrit.wikimedia.org/r/463051 (https://phabricator.wikimedia.org/T205526) [12:21:13] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists: Create a mailling list for Wiki Loves Love - https://phabricator.wikimedia.org/T203792 (10MoritzMuehlenhoff) I've just created the list: https://lists.wikimedia.org/mailman/listinfo/wll I'll contact you via mail for further details [12:21:45] zeljkof: is the train actually going to roll? [12:22:01] * addshore notices there are many blockers listed on the ticket [12:22:11] addshore: no, I'll probably revert :( [12:22:19] revert group 0 too? [12:22:22] many blockers the last time I've checked [12:22:34] I don't know, that's what the docs say to do [12:22:39] okay [12:22:57] "if there is an unexplained error that occurs within 1 hour of a train deployment — always roll back the train." [12:23:04] * addshore has to bump the parser cache epoch for wikidatawiki and i was basically trying to figure out when I can fit that in [12:23:09] https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Breakage [12:23:21] "unexplained error" [12:23:26] Theres always next time i guess :( [12:23:37] well, all errors are somewhat explained... [12:23:39] :D [12:24:17] Is the error mentioned earlier depending on what it is i may be able to look into it [12:24:50] 10Operations, 10MediaWiki-extensions-CodeReview, 10Patch-For-Review, 10Wikimedia-production-error: Exec error "Possibly missing executable file: svn diff" from Special:Code - https://phabricator.wikimedia.org/T204801 (10MoritzMuehlenhoff) >>! In T204801#4616673, @Krinkle wrote: > Agreed. If that's easier,... [12:28:05] (03CR) 10Gehel: [C: 032] Split instance define out of elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [12:29:00] 10Operations, 10Patch-For-Review: Register and identify icinga-wm - https://phabricator.wikimedia.org/T205526 (10MoritzMuehlenhoff) a:03fgiunchedi [12:29:57] (03CR) 10Filippo Giunchedi: cloudvps: add prometheus-openstack-exporter (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/462455 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [12:30:53] 10Operations, 10Mail: Mail relays needed for VMs in eqiad1 - https://phabricator.wikimedia.org/T205158 (10MoritzMuehlenhoff) a:03herron [12:32:44] (03PS72) 10Gehel: Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [12:32:52] * addshore will schedule an hour slot bumping the cache epoch for after the train [12:33:13] (03PS1) 10Elukey: profile::mariadb::misc::el::sanitization: avoid using /var/run [puppet] - 10https://gerrit.wikimedia.org/r/463053 [12:33:14] !log Deploy schema change on s4 eqiad, will generate lag - T203709 [12:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:23] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [12:34:28] Reedy, have you looked at otrs-wiki LOEA? [12:34:43] Hm? [12:35:28] prod is handling chapter email [12:37:26] To what extent? Just to one mail queue? [12:37:32] I think they wanted individual mail addresses etc [12:38:08] o/ greg-g I think the "morning swat" time slot is wrong in the gcal I believe you maintain :) [12:38:09] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10fgiunchedi) >>! In T204088#4616379, @Ottomata wrote: > BTW, I updated https://wikitech.wik... [12:38:10] oh, no individual mail addresses [12:38:17] many group ones [12:38:23] (03PS1) 10Alexandros Kosiaris: Update to 5.0.30 [software/otrs] - 10https://gerrit.wikimedia.org/r/463054 [12:38:34] but there is for example @wikimedia.fr mail handling etc. [12:39:22] elukey: there's an alert for db1107 systemd related [12:40:43] (03PS2) 10Alexandros Kosiaris: Update to 5.0.30 [software/otrs] - 10https://gerrit.wikimedia.org/r/463054 [12:41:07] Krenair: specifically... "Say, with za.wikimedia.org would you also be able to host mail for chapter members in your G Suite instance as well?" [12:41:27] yeah no that wouldn't fly [12:41:55] marostegui: thanks! Already filed https://gerrit.wikimedia.org/r/463053 [12:42:18] elukey: great! thank you [12:42:41] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update to 5.0.30 [software/otrs] - 10https://gerrit.wikimedia.org/r/463054 (owner: 10Alexandros Kosiaris) [12:42:47] (03PS2) 10Mholloway: Fix: Regenerate map tiles up to zoom level 9 with notify-tilerator-regen [puppet] - 10https://gerrit.wikimedia.org/r/462918 (https://phabricator.wikimedia.org/T202201) [12:42:49] Hence my response ;) [12:42:54] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM in principle since site/cluster arguments mirror what's in profile::cumin::target already, I've added Giuseppe too for some more insi" [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [12:45:19] (03PS1) 10Elukey: add ::passwords::recommendationapi::mysql fake passwords [labs/private] - 10https://gerrit.wikimedia.org/r/463057 [12:45:52] (03CR) 10Elukey: [V: 032 C: 032] add ::passwords::recommendationapi::mysql fake passwords [labs/private] - 10https://gerrit.wikimedia.org/r/463057 (owner: 10Elukey) [12:48:46] (03CR) 10Volans: "Looks a bit like a hack to me but I didn't had time to deeply look into it if there is a better way. I've just verified that no host in pr" [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [12:50:23] (03CR) 10Gehel: "puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler1002/12628/" [puppet] - 10https://gerrit.wikimedia.org/r/440049 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [12:52:18] (03PS2) 10Elukey: profile::mariadb::misc::el::sanitization: avoid using /var/run [puppet] - 10https://gerrit.wikimedia.org/r/463053 [12:55:12] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12629/db1108.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/463053 (owner: 10Elukey) [13:00:04] zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T1300). [13:00:48] o/ [13:00:59] so, um, 4 train blockers, so, um, no train [13:01:05] T191069 [13:01:06] T191069: 1.32.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T191069 [13:02:28] 10Operations, 10IRCecho: Puppet doesn't restart ircecho when the code changes - https://phabricator.wikimedia.org/T205539 (10Volans) [13:03:56] !log installing tomcat8 security updates [13:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:28] 10Operations, 10OTRS: Upgrade to OTRS version 5.0.30 - https://phabricator.wikimedia.org/T205540 (10akosiaris) [13:05:04] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: use lower timeout values [puppet] - 10https://gerrit.wikimedia.org/r/463062 (https://phabricator.wikimedia.org/T205524) [13:05:34] temporarily setting -r to have icinga-wm join [13:05:47] volans: ^ [13:06:05] godog: ^^^ :) [13:06:07] 10Operations, 10OTRS: Upgrade to OTRS version 5.0.30 - https://phabricator.wikimedia.org/T205540 (10akosiaris) https://community.otrs.com/security-advisory-2018-05-security-update-for-otrs-framework/ is also relevant and get's fixed by 5.0.30 as well. CVE numbers are: * CVE-2018-16587 https://community.otrs... [13:06:38] 10Operations, 10OTRS: Upgrade to OTRS version 5.0.30 - https://phabricator.wikimedia.org/T205540 (10akosiaris) 05Open>03Resolved a:03akosiaris Upgrade completed successfully [13:06:46] nice, now we wait for spam -.- [13:06:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463064 [13:07:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:07:14] yeah :D [13:09:25] (03PS1) 10Filippo Giunchedi: base: link documentation for negative disk space available reported [puppet] - 10https://gerrit.wikimedia.org/r/463065 (https://phabricator.wikimedia.org/T199198) [13:09:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463064 (owner: 10Marostegui) [13:09:54] (03CR) 10jerkins-bot: [V: 04-1] base: link documentation for negative disk space available reported [puppet] - 10https://gerrit.wikimedia.org/r/463065 (https://phabricator.wikimedia.org/T199198) (owner: 10Filippo Giunchedi) [13:10:51] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: use lower timeout values [puppet] - 10https://gerrit.wikimedia.org/r/463062 (https://phabricator.wikimedia.org/T205524) [13:11:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463064 (owner: 10Marostegui) [13:11:21] (03CR) 10jenkins-bot: db-eqiad.php: Depool db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463064 (owner: 10Marostegui) [13:11:25] taking a look on db1107 [13:11:26] (03PS2) 10Filippo Giunchedi: base: link documentation for negative disk space available reported [puppet] - 10https://gerrit.wikimedia.org/r/463065 (https://phabricator.wikimedia.org/T199198) [13:11:36] (03PS3) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: use lower timeout values [puppet] - 10https://gerrit.wikimedia.org/r/463062 (https://phabricator.wikimedia.org/T205524) [13:11:42] banyek: elukey already replied, scroll back [13:11:57] sorry I was on a google meet [13:12:12] no worries, that is why I was saying that he already replied :) [13:12:21] PROBLEM - HHVM rendering on mw2164 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [13:12:23] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: eqiad1: use lower timeout values [puppet] - 10https://gerrit.wikimedia.org/r/463062 (https://phabricator.wikimedia.org/T205524) (owner: 10Arturo Borrero Gonzalez) [13:12:25] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2071 (duration: 00m 56s) [13:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:34] !log Deploy schema change on db2071 - T203709 [13:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:42] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [13:13:15] banyek: sorry didn't ack the alert, just resolved. My fault since /var/run is on tmpfs, and I was storing in there the last date of execution for eventlogging_sanitization. So after the reboot, not there anymore [13:13:20] I just realized today about the issue [13:13:22] RECOVERY - HHVM rendering on mw2164 is OK: HTTP OK: HTTP/1.1 200 OK - 74626 bytes in 0.245 second response time [13:13:46] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463066 [13:13:55] ah [13:14:27] elukey: so no obstaces of the upgrade for db1108 and we have to take care for this [13:14:29] ? [13:15:21] banyek: nono already fixed it, for db1108 should be ok (now the file is under /srv) [13:15:41] \o/ [13:19:24] (03CR) 10Filippo Giunchedi: [C: 032] base: link documentation for negative disk space available reported [puppet] - 10https://gerrit.wikimedia.org/r/463065 (https://phabricator.wikimedia.org/T199198) (owner: 10Filippo Giunchedi) [13:19:31] (03PS3) 10Filippo Giunchedi: base: link documentation for negative disk space available reported [puppet] - 10https://gerrit.wikimedia.org/r/463065 (https://phabricator.wikimedia.org/T199198) [13:21:24] godog: I'll try something: �gIPY��fp�X �R�#���C��|ό�V�`2�� [13:21:49] lol [13:21:54] damn, doesn't work ���wE�~�l�X�wEM9���W^�A [13:22:08] fine, let's wait for the real one [13:23:09] last one, I'm too curious Y޻#d [13:24:24] Reedy: lol, thanks [13:24:31] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:24:31] heh [13:24:33] I was trying to kill icinga-wm [13:24:38] it's a loong story [13:24:45] poorly written software? [13:24:46] Never [13:24:50] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Ottomata) I didn't realize that either! Documenting. [13:24:51] PROBLEM - puppet last run on bast1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:25:02] -��+'̌6]�3B�,J��s�q0~�Cd�/��S����9��>��F�H5��w"�V�sm%�U��Ze�(aza� [13:25:06] yay! [13:25:11] godog: ^^^ [13:25:28] as I expected UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 92: invalid start byte [13:25:32] PROBLEM - puppet last run on wdqs2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:25:33] I'll send a quick fix [13:25:41] PROBLEM - puppet last run on cp1088 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:25:51] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:02] PROBLEM - puppet last run on wdqs2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:12] PROBLEM - puppet last run on db2094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:21] mmmh too many failures, checking puppetdb [13:26:22] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:23] PROBLEM - puppet last run on maps2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:23] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:24] PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:24] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:31] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:31] PROBLEM - puppet last run on mw2241 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:41] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:41] PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:41] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:41] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:41] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:50] godog: Monitoring::Check_prometheus[filesystem_avail_bigger_than_size]: has no parameter named 'nodes_url' [13:26:52] PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:55] revert or fix please [13:27:02] PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:03] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:03] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:11] PROBLEM - puppet last run on mwlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:11] PROBLEM - puppet last run on puppetdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:12] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:22] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:22] PROBLEM - puppet last run on kafka1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:33] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:33] PROBLEM - puppet last run on ping1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:33] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:33] PROBLEM - puppet last run on mw1318 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:33] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:33] PROBLEM - puppet last run on puppetboard2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:42] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:42] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:42] PROBLEM - puppet last run on auth1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:42] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:43] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:43] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:51] PROBLEM - puppet last run on elastic2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:51] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:51] PROBLEM - puppet last run on elastic2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:27:52] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:01] PROBLEM - puppet last run on cp1083 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:02] PROBLEM - puppet last run on mwdebug2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:11] PROBLEM - puppet last run on mc1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:11] PROBLEM - puppet last run on mc1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:11] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:11] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:11] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:12] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:12] PROBLEM - puppet last run on cp1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:12] PROBLEM - puppet last run on db1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:13] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:14] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:14] PROBLEM - puppet last run on wdqs1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:15] PROBLEM - puppet last run on wdqs1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:15] hey zeljkof, just a quick question [13:28:17] volans: sorry [13:28:21] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:21] PROBLEM - puppet last run on db1100 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:21] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:21] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:21] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:22] PROBLEM - puppet last run on tureis is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:22] PROBLEM - puppet last run on rdb2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:27] (03PS1) 10Filippo Giunchedi: Revert "base: link documentation for negative disk space available reported" [puppet] - 10https://gerrit.wikimedia.org/r/463069 [13:28:31] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:31] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:41] PROBLEM - puppet last run on mw1280 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:41] PROBLEM - puppet last run on db2090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:41] PROBLEM - puppet last run on debmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:41] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:41] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:42] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:42] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:43] PROBLEM - puppet last run on wtp2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:43] PROBLEM - puppet last run on mw2247 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:44] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:44] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:45] PROBLEM - puppet last run on cloudservices1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:45] PROBLEM - puppet last run on cp1084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:46] PROBLEM - puppet last run on cp1085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:28:49] (03CR) 10Filippo Giunchedi: [C: 032] Revert "base: link documentation for negative disk space available reported" [puppet] - 10https://gerrit.wikimedia.org/r/463069 (owner: 10Filippo Giunchedi) [13:29:01] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:02] PROBLEM - puppet last run on restbase2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:02] PROBLEM - puppet last run on wtp2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:02] volans: at least icinga-wm works [13:29:02] PROBLEM - puppet last run on mw2217 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:02] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:02] PROBLEM - puppet last run on mw2267 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:02] PROBLEM - puppet last run on ores2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:07] godog: yeah it's just a typo [13:29:08] volans: ok to stop it? [13:29:11] PROBLEM - puppet last run on mw2257 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:11] PROBLEM - puppet last run on labpuppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:11] PROBLEM - puppet last run on mw1348 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:12] PROBLEM - puppet last run on mw1314 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:12] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:12] PROBLEM - puppet last run on mw1340 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:12] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:13] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:13] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:14] sure [13:29:14] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:15] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:15] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:16] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:16] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:17] PROBLEM - puppet last run on mw2289 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:17] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:39] oof [13:30:17] need a hand? [13:30:31] nah it is fine, thanks herron [13:30:31] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463066 (owner: 10Marostegui) [13:30:41] godog: it was just a typo actually [13:30:44] nodes_url vs notes-url [13:30:48] *notes_url [13:30:50] I think [13:31:02] yeah I didn't run the compiler, lesson learned [13:31:38] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463066 (owner: 10Marostegui) [13:32:30] I'll force-run puppet [13:32:44] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2071 (duration: 00m 55s) [13:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:17] yeah https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed to the help [13:33:34] hehhe indeed [13:33:40] ok to +r again the channel? [13:34:00] !log Stop replication on s4 eqiad master for maintenance, lag will be generated [13:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:11] godog: ok for now, I'll cheking if I can send a quick patch to have it not dye [13:35:14] die [13:36:39] indeed [13:38:11] !log reboot an-master1001 to clear out an issue with systemd@fsck (Hadoop master, failover to an-master1002 included) [13:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:04] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463066 (owner: 10Marostegui) [13:43:36] (03CR) 10MSantos: [C: 031] "Nice catch. Thanks for looking into that." [puppet] - 10https://gerrit.wikimedia.org/r/462918 (https://phabricator.wikimedia.org/T202201) (owner: 10Mholloway) [13:45:17] !log Stop replication on dbstore1002:s4 for maintenance [13:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:58] (03PS1) 10Filippo Giunchedi: base: link documentation for negative disk space available reported #2 [puppet] - 10https://gerrit.wikimedia.org/r/463071 (https://phabricator.wikimedia.org/T199198) [13:55:32] zeljkof: how did the train go? [13:55:34] rollback? [13:56:28] addshore: well, I decided to do nothing :D [13:56:52] nothing seems to be broken, so I didn't roll back, but because of the blockers I didn't roll forward too [13:57:00] awesome [13:57:03] so, stuck at group 0 [13:57:06] * addshore will start preping for his slot now then [13:57:08] jouncebot: next [13:57:08] In 1 hour(s) and 2 minute(s): Fix Wikidata UBN - ParserCache Epoch Bump (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T1500) [13:57:22] a few of the tasks seems easier to debug with .23 at group 0 [13:57:23] (03PS1) 10Mobrovac: RunSingleJob: Delay job execution while in read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) [13:57:24] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Add cumin aliases for each wdqs clusters - https://phabricator.wikimedia.org/T205542 (10Gehel) [13:57:42] oh wait [13:57:46] my deploy slot is in 1 hour..... [13:57:51] * addshore disapears for 1 more hour.... [13:58:07] gj [13:58:28] hehe, i hit +2, and then took the +2 away again ... [13:58:29] silly me [13:59:30] 10Operations, 10SRE-Access-Requests: Give Mathew full root on wdqs servers - https://phabricator.wikimedia.org/T205543 (10Gehel) [14:00:31] addshore: please ask ops mail [14:00:43] Ack* [14:02:14] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists: Create a mailling list for Wiki Loves Love - https://phabricator.wikimedia.org/T203792 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03MoritzMuehlenhoff [14:03:07] 10Operations, 10SRE-Access-Requests: Give Mathew full root on wdqs servers - https://phabricator.wikimedia.org/T205543 (10MoritzMuehlenhoff) p:05Triage>03Normal [14:03:23] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Add cumin aliases for each wdqs clusters - https://phabricator.wikimedia.org/T205542 (10MoritzMuehlenhoff) p:05Triage>03Normal [14:04:06] 10Operations, 10Traffic: Integrate certspotter with certcentral to avoid certspotter notifying us on legitimate certs generated by our certcentral boxes - https://phabricator.wikimedia.org/T204994 (10MoritzMuehlenhoff) p:05Triage>03Normal [14:07:12] zeljkof: Merging two backports for .23 [14:07:32] Reedy: deploying them too? [14:07:39] Will deploy them when jerkins decides to let me [14:07:41] Yeah [14:07:48] It looks like we have a patch for 1 more too [14:08:03] ok, thanks for letting me know [14:08:05] Not sure if https://phabricator.wikimedia.org/T205497 actually blocks the train [14:08:15] train is in sad state at the moment with 4 blockers [14:08:31] AFK for 10 or so, will sort/deploy them then [14:08:36] (if jerkins is done) [14:08:59] Reedy: seems like ryasmeen thinks it's a blocker https://phabricator.wikimedia.org/T205497#4617609 [14:09:02] I don't really know [14:09:30] Depends on whether it's config or actual code issue from parsiod [14:09:40] Devs should conform [14:09:43] Confirm [14:09:49] T205497 seems fixed right now [14:09:50] T205497: [Regression pre-wmf.23] REST API on Beta cluster returns content of different pages than requested (breaks VE) - https://phabricator.wikimedia.org/T205497 [14:09:55] but we don't know what fixed it [14:10:47] Which means it didn't get fixed. The bugs just on holiday [14:11:49] indeed [14:16:41] godog: I guess icinga spam is over if you remove +r for a minute I'll restart the bot for now [14:16:44] to have it [14:16:54] (I'm not an op ;) ) [14:17:53] volans: spam is not done its been happening for days [14:18:07] Different spam [14:18:08] Zppix: I know, I meant icinga spam because of alerts [14:18:10] volans: ack [14:18:20] Oh [14:18:47] https://www.seriouseats.com/images/2012/06/20120627-spam-primary.jpg [14:18:52] and it's back [14:18:59] Krinkle: lol [14:19:14] Lol [14:20:10] (03CR) 10Ottomata: [C: 031] "Looks ok: https://puppet-compiler.wmflabs.org/compiler1002/12630/" [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [14:20:27] _joe_: ^ could you briefly check ? [14:20:33] When I saw this kind of photo the first time, I assumed it was a photoshop job for fun. The brand wasn't particularly well known in Holland around 2001. And I didn't know about Monthly Python yet! [14:20:44] <_joe_> ottomata: sure, gimmme a couple minutes [14:20:49] sure np [14:21:35] (03PS9) 10Giuseppe Lavagetto: php: add service management for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/454478 (https://phabricator.wikimedia.org/T201140) [14:21:50] (03PS1) 10Andrew Bogott: cloudvirt1019: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/463079 (https://phabricator.wikimedia.org/T205524) [14:22:03] Krinkle: zeljkof: i don't think T205497 should be a blocked for the deployment (i commented on it), and it seems that the other three already have merged patches [14:22:04] T205497: [Regression pre-wmf.23] REST API on Beta cluster returns content of different pages than requested (breaks VE) - https://phabricator.wikimedia.org/T205497 [14:22:21] (03CR) 10jerkins-bot: [V: 04-1] cloudvirt1019: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/463079 (https://phabricator.wikimedia.org/T205524) (owner: 10Andrew Bogott) [14:23:01] MatmaRex: all I know is that tasks are still subtasks of train, and none of them is resolved [14:23:34] if you think a task is not a blocker, you should remove it from train subtasks, and if a tasks is resolved, it should be resolved in phab [14:23:36] addshore: probably, that thing is so far down my priority list :) [14:23:38] * zeljkof shrugs [14:23:57] (03PS2) 10Andrew Bogott: cloudvirt1019: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/463079 (https://phabricator.wikimedia.org/T205524) [14:24:43] (03CR) 10jerkins-bot: [V: 04-1] cloudvirt1019: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/463079 (https://phabricator.wikimedia.org/T205524) (owner: 10Andrew Bogott) [14:26:25] (03PS3) 10Andrew Bogott: cloudvirt1019: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/463079 (https://phabricator.wikimedia.org/T205524) [14:27:09] (03CR) 10Andrew Bogott: [C: 032] cloudvirt1019: add ipv6 address [puppet] - 10https://gerrit.wikimedia.org/r/463079 (https://phabricator.wikimedia.org/T205524) (owner: 10Andrew Bogott) [14:27:58] zeljkof: I'm amused most of .23 is timestamped 13:37 :) [14:29:18] Reedy: oh, I did that on purpose ;) [14:29:54] I think shinken is a victim of unicode cause it died after someone spammed it [14:30:31] https://phabricator.wikimedia.org/T205522 [14:30:32] Zppix: yes it is [14:31:12] 10Operations, 10IRCecho: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10Volans) I've actually confirmed it, the bug is in the python-irc library, I'm looking if I can do a quick fix [14:31:16] jerkins is on a go slow today [14:31:20] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Cmjohnson) A support ticket has been submitted with HPE Case ID: 5332806955 [14:31:27] Reedy: when isnt it? [14:31:46] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Marostegui) Thanks!! :-) [14:33:02] PROBLEM - Filesystem available is greater than filesystem size on ms-be2041 is CRITICAL: cluster=swift device=/dev/sdn1 fstype=xfs instance=ms-be2041:9100 job=node mountpoint=/srv/swift-storage/sdn1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [14:36:50] (03PS1) 10Andrew Bogott: Added ipv6 for cloudvirt1019.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/463083 [14:37:04] (03CR) 10jerkins-bot: [V: 04-1] Added ipv6 for cloudvirt1019.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/463083 (owner: 10Andrew Bogott) [14:38:07] (03PS2) 10Andrew Bogott: Added ipv6 for cloudvirt1019.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/463083 (https://phabricator.wikimedia.org/T205524) [14:38:58] (03CR) 10Andrew Bogott: [C: 032] Added ipv6 for cloudvirt1019.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/463083 (https://phabricator.wikimedia.org/T205524) (owner: 10Andrew Bogott) [14:39:02] Reedy: did you deploy .23 backports? which tasks are they related to? (asking to see if the train can move forward, there is still some time left in the window) [14:39:09] No, they've not merged yet [14:39:32] which commits are you merging? [14:42:22] raynor, MatmaRex, for T205449, both of these commits need to be merged and backported? or just the one that got merged? https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/461848/ https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/462749 [14:42:23] T205449: SkinMinerva.php file logs "Undefined variable: returntoquery` error - https://phabricator.wikimedia.org/T205449 [14:42:57] https://gerrit.wikimedia.org/r/463068 and https://gerrit.wikimedia.org/r/463073 [14:43:01] CI still going [14:43:31] (03CR) 10Marostegui: [C: 031] backups: Setup new backup check for all sections on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/461679 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [14:43:51] https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/MinervaNeue/+/462749/ 0 this is the fix [14:44:08] Reedy: so those two when merged will no longer block the train? (if so, could you please remove them from blockers then) [14:44:33] https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/MinervaNeue/+/461848/ - this is whats causing the issue, it's already in the train code, it doesn't have to backported [14:44:34] (03CR) 10Marostegui: [C: 031] "I don't know what jenkins means either. But the +1 is because i am fine with this change and how the check will work" [puppet] - 10https://gerrit.wikimedia.org/r/461679 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [14:44:38] zeljkof, ^ [14:44:41] zeljkof: preferably resolve them,not remove [14:45:24] greg-g: good point, if it's merged in both master and .23 then it's resolved [14:45:29] * zeljkof got confused [14:45:53] * zeljkof is trying to juggle 4 tasks, but I can only juggle 3 [14:46:05] zeljkof, -> this is the only one we need to backport (https://gerrit.wikimedia.org/r/#/c/462749/) [14:46:16] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/CirrusSearch/: T205473 (duration: 01m 06s) [14:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:25] T205473: Fatal error "Invalid operand type" from CirrusSearch LinksUpdate - https://phabricator.wikimedia.org/T205473 [14:46:27] to fix the undefined variable returnqueryto error (https://phabricator.wikimedia.org/T205449) [14:46:27] Reedy: thanks! ^ [14:46:32] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 59.88 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:46:34] raynor: can we backport it now? I see it's merged in master [14:46:39] yes, we can [14:46:44] do you want me to backport it? [14:46:52] raynor: please do [14:46:53] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Add cumin aliases for each wdqs clusters - https://phabricator.wikimedia.org/T205542 (10Mathew.onipe) a:03Mathew.onipe [14:47:07] it will take a while to merge, but we still have 15 minutes officially [14:47:58] so that leaves only T205497 as train blocker cc MatmaRex [14:47:59] T205497: [Regression pre-wmf.23] REST API on Beta cluster returns content of different pages than requested (breaks VE) - https://phabricator.wikimedia.org/T205497 [14:48:03] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.23/includes/specials/SpecialMostlinkedcategories.php: T205469 (duration: 00m 55s) [14:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:11] MatmaRex: did you say you don't think it's a blocker? [14:48:11] T205469: Fatal error from LinkRenderer on Special:MostLinkedCategories ("Object of class HtmlArmor could not be converted") - https://phabricator.wikimedia.org/T205469 [14:48:24] * zeljkof is confused with trying to keep up with a few tasks [14:48:55] you're doing great zeljkof, sometimes I have problems with keeping up with a single task. [14:48:58] (03PS1) 10Mathew.onipe: cumin: added aliases for each wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/463088 (https://phabricator.wikimedia.org/T205542) [14:49:00] Yeah, it's worth keeping an eye on... But I don't think it's n active blocker [14:49:43] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think this is not the right solution, and introduces additional technical debt." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [14:50:58] greg-g: looks like T205497 will be the only remaining blocker in the near future, and people think it's not actually blocking the train, should I move .23 to group 1 and see what happens? (when the remaining tasks are resolved) [14:51:25] I agree with MatmaRex's assessment there in his last comment [14:51:33] so yes [14:51:37] ottomata, ^ that change is all yours from here [14:53:23] jouncebot: next [14:53:24] In 0 hour(s) and 6 minute(s): Fix Wikidata UBN - ParserCache Epoch Bump (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T1500) [14:53:32] ok thanks Krenair, thanks for helping me figure out why it wasn't working in beta [14:53:35] (03CR) 10Vgutierrez: [C: 031] api: Make OSE a lot louder [software/certcentral] - 10https://gerrit.wikimedia.org/r/458936 (owner: 10Alex Monk) [14:53:39] Am I still okay to deploy my things in my scheduled slot? :) [14:53:45] greg-g: ok, Reedy and raynor are backporting/merging/deploying commits for the other three tasks [14:53:49] greg-g: zeljkof ^^ (in a meeting currently, just checking in advance of the slot) [14:53:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Is it ? I don't see anything in any of" [puppet] - 10https://gerrit.wikimedia.org/r/462600 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [14:53:53] Two are done and deployed :P [14:54:04] addshore: not sure, we're just finishing up train [14:54:11] addshore: do you need the entire hour? [14:54:13] https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/MinervaNeue/+/463087/ [14:54:16] godog: lost again :( [14:54:18] just CR+2 [14:54:33] Reedy: great, thanks! [14:54:40] godog: I've a meeting in 5 minutes if you want the super-easy fix is also the super-hacky one :) [14:54:47] Reedy, I was waiting for CI, just for sanity :) [14:54:47] volans: lost against the spam [14:54:48] addshore: we should be done "soon", maybe another 30 minutes tops? [14:54:54] zeljkof: yes, i guess i'm just going to remove it as a blocker [14:55:00] are you going to deploy that patch? [14:55:00] raynor: so Reedy is done, you're free to deploy as soon as CI lets you [14:55:14] raynor: If it's passing on master etc, you can just automatically CR+2 the cherry pick [14:55:14] yeah, I just noticed the +2 [14:55:19] volans: yeah I don't know tbh, the whole ircecho situation is a bit embarassing in itself [14:55:27] zeljkof: looks like folks are investigating it and it definitely doesn't look like a wmf.23 issue to me [14:55:54] indeeed, but we need a solution ~now (or better from tomorrow it might be able to reconnect, but still) [14:55:54] Reedy - that's right but I usually prefer to wait that ~5mins to just double check :) [14:55:55] MatmaRex: great, thanks, that's the only thing that is blocking the train now (others are being resolved as we speak) [14:56:06] Less time to wait if you just CR+2 it [14:56:08] yeah. nice work [14:56:13] It still runs the same tests, so if it's gonna fail, it'll still fail [14:56:43] good point [14:56:59] Just saves some time [14:57:17] CR+2/V+2 and submit in one go is a different story. But we try to avoid doing that most of the time :) [14:57:32] greg-g: okay! [14:58:08] (03CR) 10Alex Monk: Try to make get_clusters work inside labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [14:58:29] volans: so yeah honestly I'd go for the super hacky for now as we rely quite a bit on ircecho :( at least I do [14:59:13] ack, I can revert it tomorrow if the auth works [15:00:04] addshore: Dear deployers, time to do the Fix Wikidata UBN - ParserCache Epoch Bump deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T1500). [15:00:04] tarrow: A patch you scheduled for Fix Wikidata UBN - ParserCache Epoch Bump is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [15:01:10] volans: we'd still have to make all channels +r to be able to revert the "invalid utf8 crashes ircecho" fix though? [15:01:31] not really [15:01:41] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 58.02 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:02:00] <_joe_> addshore: questions were asked in the ops list about this, which didn't get an answer [15:02:53] !log rolling restart of wdqs-test and wdqs-internal for JVM + kernel upgrade [15:02:55] (03PS1) 10Andrew Bogott: Rabbitmq: allow labvirt/cloudvirt access via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/463090 (https://phabricator.wikimedia.org/T205524) [15:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:32] (03CR) 10jerkins-bot: [V: 04-1] Rabbitmq: allow labvirt/cloudvirt access via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/463090 (https://phabricator.wikimedia.org/T205524) (owner: 10Andrew Bogott) [15:05:06] (03PS2) 10Andrew Bogott: Rabbitmq: allow labvirt/cloudvirt access via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/463090 (https://phabricator.wikimedia.org/T205524) [15:05:20] godog: hack done, cannot promise it works, but it should, so if spam in other channels happens, it should skip the data and not restart [15:05:47] we should temporarily -r all the other channels and restart it to have it everywhere, but maybe worth waiting if it works [15:05:49] (03CR) 10jerkins-bot: [V: 04-1] Rabbitmq: allow labvirt/cloudvirt access via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/463090 (https://phabricator.wikimedia.org/T205524) (owner: 10Andrew Bogott) [15:06:01] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 70.6 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:06:07] Reedy, zeljkof can I deploy the skinminerva backport? (it's merged) [15:06:21] Fine by me [15:07:01] raynor: addshore has the window now, but he said something about being in meeting [15:07:03] I'm just checking that no one is porting stuff [15:07:15] <_joe_> addshore: specifically Timo asked if this couldn't be done via a cli script, and Manuel asked about impact on the databases [15:07:17] addshore: can you confirm raynor can deploy a backport now? [15:07:19] He's still in a meeting; I hope he comes out soon [15:07:23] (03PS3) 10Andrew Bogott: Rabbitmq: allow labvirt/cloudvirt access via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/463090 (https://phabricator.wikimedia.org/T205524) [15:07:27] I can wait [15:07:43] raynor: looks like addshore is in a meeting, go ahead while we wait for him [15:07:56] roger that [15:08:04] =] [15:08:19] addshore: can we deploy? can you wait? [15:08:30] also, people asking you not do deploy? :) [15:09:16] DEPLOY ALL THE THINGS [15:09:50] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/463088 (https://phabricator.wikimedia.org/T205542) (owner: 10Mathew.onipe) [15:09:59] <_joe_> no I'm asking for questions to be addressed [15:10:13] (03CR) 10Andrew Bogott: [C: 032] Rabbitmq: allow labvirt/cloudvirt access via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/463090 (https://phabricator.wikimedia.org/T205524) (owner: 10Andrew Bogott) [15:11:13] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Banyek) I will reclone this database instance [15:11:31] ALL YOUR DEPLOYS BELONG TO US [15:11:34] (03CR) 10Muehlenhoff: cumin: added aliases for each wdqs clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463088 (https://phabricator.wikimedia.org/T205542) (owner: 10Mathew.onipe) [15:12:05] 10Operations, 10DBA, 10MediaWiki-extensions-Translate, 10Performance-Team, and 2 others: DBPerformance warning "Query returned 22186 rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10Gilles) [15:12:47] (03PS1) 10Andrew Bogott: rabbitmq: fix ferm list syntax [puppet] - 10https://gerrit.wikimedia.org/r/463092 (https://phabricator.wikimedia.org/T205524) [15:13:39] (03CR) 10Andrew Bogott: [C: 032] rabbitmq: fix ferm list syntax [puppet] - 10https://gerrit.wikimedia.org/r/463092 (https://phabricator.wikimedia.org/T205524) (owner: 10Andrew Bogott) [15:13:54] zeljkof: you can go if he's not responsive [15:14:02] zeljkof: the last info to him was we'd be going long [15:14:02] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:14:49] raynor: are you deploying? [15:14:54] or waiting? [15:14:55] :D [15:15:02] deploying, [15:15:04] * addshore is still waiting for you to say I'm clear :) [15:15:11] I've assumed you're deploying, unless addshore vetoes :) [15:15:12] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [15:15:15] thanks addshore [15:15:19] * addshore is not yet :) [15:15:19] I'm testing it on test.wikipedia.org [15:15:19] and thanks raynor [15:15:37] * addshore will however +2 his patches momentairly and wait for jenkins [15:15:38] greg-g: raynor is already deploying, addshore is waiting [15:16:06] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Discovery-Search (Current work), 10Maps (Tilerator): Log slow queries on postgresql / maps - https://phabricator.wikimedia.org/T204106 (10MSantos) [15:16:52] (03CR) 10Gehel: [C: 04-1] cumin: added aliases for each wdqs clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463088 (https://phabricator.wikimedia.org/T205542) (owner: 10Mathew.onipe) [15:16:56] * addshore has +2ed his 2 patches https://gerrit.wikimedia.org/r/#/q/Icd04008b2544ddaf18879fc1945dc7519f7513a6 but they will take ~20 mins or so to merge now anyway [15:18:45] I need 3 more mins [15:18:55] (03PS2) 10Ottomata: Introduce profile::cumin::selector dummy class [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [15:19:32] addshore: did you see _joe_'s quesetions? [15:19:38] (03CR) 10jerkins-bot: [V: 04-1] Introduce profile::cumin::selector dummy class [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [15:20:41] greg-g: *looks up* [15:21:04] _joe_: I didn't see any email but I also havn't looked in the last 4 hours or so, let me look [15:21:27] <_joe_> heh not my questions, Timo and Manuel's [15:21:46] <_joe_> addshore: most people looked at the ops list this afternoon [15:21:53] _joe_: i have the emails, reading now :) [15:21:59] <_joe_> thanks :) [15:22:24] looks like we will just use this slot for 2 backports now and deal with the cache purge afterwards [15:23:18] jouncebot: now [15:23:18] For the next 0 hour(s) and 36 minute(s): Fix Wikidata UBN - ParserCache Epoch Bump (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T1500) [15:23:20] (03PS3) 10Ottomata: Introduce profile::cumin::selector dummy class [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [15:23:48] addshore: looks like we should delay your epoch changes, until after that discussion onlist is resolved [15:24:21] yup, Ideally we would fix this today, as the editing experience is pretty broken [15:24:29] * addshore looks into the script mentioned [15:24:29] addshore: you have 2 wikidata related backports? [15:24:33] zeljkof: yes [15:24:44] (03PS2) 10Mathew.onipe: cumin: added aliases for each wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/463088 (https://phabricator.wikimedia.org/T205542) [15:24:53] addshore: is there a task? I mean, are they blocking the train? or unrelated? [15:25:05] they are un related [15:25:16] this is actually fallout from the train last week [15:25:37] raynor: Everything ok? [15:26:31] yes, doing scap sync [15:26:36] zeljkof: see ops@ list [15:26:48] Krinkle: any direction toward this daily cron? :) [15:26:50] raynor: great, I'm going to SoS in a few minutes, I'll check the train after [15:27:00] greg-g: will do [15:27:31] gmm [15:27:35] zeljkof, question [15:27:52] aah, it just runs purgeParserCache.php ? *looks* [15:28:13] ok, nvm [15:28:38] raynor: go ahead [15:28:59] nah, nvm [15:29:06] newbie issue, I was in wrong dir [15:29:39] !log pmiazga@deploy1001 Synchronized php-1.32.0-wmf.23/skins/MinervaNeue/includes/skins/SkinMinerva.php: SWAT: [[gerrit:463087|Create $returntoquery variable properly(T205449)]] (duration: 00m 56s) [15:29:40] addshore, zeljkof: deployed [15:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:47] T205449: SkinMinerva.php file logs "Undefined variable: returntoquery` error - https://phabricator.wikimedia.org/T205449 [15:29:56] raynor: thanks [15:30:01] hmm, now I think, it's most probably not SWAT, [15:30:11] what do write in the comment in such cases? zeljkof [15:30:22] raynor, Reedy: could you please resolve the tasks that are blocking the train? [15:30:42] raynor: thanks! [15:30:44] zeljkof, I'll do it, but first I want to see that error goes away [15:30:48] raynor: "emergency train fix"? [15:31:08] give me ~15mins, I'll check wikis and if error is not there I'll resolve the task [15:31:21] "emergency train fix"? --> added to my notebook [15:31:34] raynor: great, I'm in SoS meeting, so I'll be distracted for the next 30 minutes anyway [15:34:49] (03CR) 10Cwhite: "> I'm still wondering if all this is really needed as it seems quite" [puppet] - 10https://gerrit.wikimedia.org/r/462793 (owner: 10Cwhite) [15:37:51] PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdd1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sdd1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [15:39:53] (03PS17) 10Jcrespo: backups: Setup new backup check for all sections on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/461679 (https://phabricator.wikimedia.org/T203969) [15:39:58] right, going to deploy my 2 backports [15:40:07] (03PS2) 10Jcrespo: mariadb backup monitoring: Add size checks [puppet] - 10https://gerrit.wikimedia.org/r/462724 (https://phabricator.wikimedia.org/T203969) [15:40:32] (03CR) 10jerkins-bot: [V: 04-1] backups: Setup new backup check for all sections on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/461679 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [15:41:28] (03CR) 10Giuseppe Lavagetto: php: add service management for php-fpm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/454478 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [15:43:33] (03PS3) 10Mathew.onipe: cumin: added aliases for each wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/463088 (https://phabricator.wikimedia.org/T205542) [15:43:38] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Papaul) [15:43:50] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462834 (https://phabricator.wikimedia.org/T201285) (owner: 10Jforrester) [15:45:58] (03PS10) 10Giuseppe Lavagetto: php: add service management for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/454478 (https://phabricator.wikimedia.org/T201140) [15:46:20] 10Operations, 10Community-Tech, 10MediaWiki-Parser, 10Traffic: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10cscott) Note that we're using a non-standard extension to SVG in order to translate SVGs. Some earlier discussion here: {T5593}, {T60920}. That said, that... [15:47:06] <_joe_> moritzm: ^^ I added an "immutable_config" hash with the values that cannot be overridden as they're defined by systemd [15:47:10] <_joe_> as you suggested [15:50:36] (03CR) 10Ppchelko: RunSingleJob: Delay job execution while in read-only mode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) (owner: 10Mobrovac) [15:50:58] doing to sync [15:51:01] *going to [15:53:08] !log addshore@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/WikibaseLexeme: [[gerrit:463050|form statement groups: ids without lexeme part]] T196226 (duration: 01m 06s) [15:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:15] T196226: HTML 'id' values should be unique in Wikidata lexeme pages - https://phabricator.wikimedia.org/T196226 [15:54:37] !log ulsfo X-connect hot cut scheduled for in 5min [15:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:04] !log addshore@deploy1001 Synchronized php-1.32.0-wmf.22/extensions/WikibaseLexeme: [[gerrit:463060|form statement groups: ids without lexeme part]] T196226 (duration: 01m 01s) [15:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:11] right, that is the 2 backports done [15:56:50] 10Operations, 10Wikimedia-Logstash, 10monitoring, 10Scoring-platform-team (Current), and 2 others: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630 (10Ladsgroup) a:03Ladsgroup [15:57:59] (03PS4) 10Ottomata: Introduce cumin::selector dummy class [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [15:58:14] (03CR) 10Alex Monk: [C: 032] " vgutierrez, seeing as you've CR+1'd the bottom three on that big stack, maybe it's a good idea to consider merging those?" [software/certcentral] - 10https://gerrit.wikimedia.org/r/458933 (owner: 10Alex Monk) [15:59:00] (03PS5) 10Ottomata: Introduce cumin::selector dummy class [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [16:00:09] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T1600). [16:00:09] No GERRIT patches in the queue for this window AFAICS. [16:00:16] (03Merged) 10jenkins-bot: Change secure opener mode to 640 [software/certcentral] - 10https://gerrit.wikimedia.org/r/458933 (owner: 10Alex Monk) [16:00:43] Hey, I'm going to commandeer SWAT for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/463105 [16:00:48] oh great, no patches for swat, if all train blockers are resolved, I'll move .23 from group 0 to group 1 [16:00:54] Oh, hey zeljkof. [16:01:06] Timing. :-) [16:01:29] James_F: sorry, just checked calendar, looked clean :) [16:01:37] zeljkof: If you want to push out https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/463104 for wmf.23 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/463105 for wmf.22 in the train I'm happy to let you. ;-) [16:01:54] (03CR) 10jenkins-bot: Change secure opener mode to 640 [software/certcentral] - 10https://gerrit.wikimedia.org/r/458933 (owner: 10Alex Monk) [16:02:03] (03CR) 10Alex Monk: [C: 032] " vgutierrez, seeing as you've CR+1'd the bottom three on that big stack, maybe it's a good idea to consider merging those?" [software/certcentral] - 10https://gerrit.wikimedia.org/r/458936 (owner: 10Alex Monk) [16:02:11] James_F: let me move the train, it should be shorter than the couple of backports you have :D [16:02:40] James_F: sounds good? train now, swat in 10-20 minutes? [16:03:03] Sure. [16:03:56] (03Merged) 10jenkins-bot: api: Make OSE a lot louder [software/certcentral] - 10https://gerrit.wikimedia.org/r/458936 (owner: 10Alex Monk) [16:04:28] herron: does my last comment on https://phabricator.wikimedia.org/T41785 make sense, and is that a tolerable request? [16:05:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I am guessing the caller (profile::icinga) needs to also be updated" [puppet] - 10https://gerrit.wikimedia.org/r/462835 (owner: 10Dzahn) [16:05:30] andrewbogott: yeah makes sense and happy to rebuild the VMs. I guess we’ll have to move over the floating IP and DNS as well [16:05:41] (03CR) 10jenkins-bot: api: Make OSE a lot louder [software/certcentral] - 10https://gerrit.wikimedia.org/r/458936 (owner: 10Alex Monk) [16:06:15] herron: yeah, it'll probably be easiest to just give them new IPs, I'm not sure if I can explicitly transfer an IP from one project to another [16:06:22] thanks in advance for rebuilding [16:06:33] I guess that's my fault [16:06:33] sorry! [16:06:36] ok, and no problem [16:06:48] paravoid: no worries, it's a good idea [16:06:59] it will be good to rebuild the hosts anyway with the new role/profile I’m working on so things like spamassassin aren’t left installed [16:07:00] (03PS1) 10Zfilipin: group1 wikis to 1.32.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463107 [16:07:02] (03CR) 10Zfilipin: [C: 032] group1 wikis to 1.32.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463107 (owner: 10Zfilipin) [16:07:58] !log downtimed ipsec alerts on cp[12]xxx for ulsfo outage [16:08:03] (03PS2) 10Mobrovac: RunSingleJob: Delay job execution while in read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) [16:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:11] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463107 (owner: 10Zfilipin) [16:08:51] (03CR) 10Mobrovac: RunSingleJob: Delay job execution while in read-only mode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) (owner: 10Mobrovac) [16:08:57] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463107 (owner: 10Zfilipin) [16:09:59] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.23 [16:10:54] !log zfilipin@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.23 (duration: 00m 54s) [16:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:56] (03CR) 10Ppchelko: [C: 031] "LGTM let's see what Joe thinks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) (owner: 10Mobrovac) [16:13:31] James_F: I'm done, waiting to see if the usual `Fatal error: entire web request took longer than 60 seconds and timed out` storm after deployment will calm in a few minutes [16:13:44] [{exception_id}] {exception_url} PHP Fatal Error from line 36 of /srv/mediawiki/php-1.32.0-wmf.23/extensions/3D/src/ThreeDThumbnailImage.php: Class undefined: HTML MediaWiki or an installed extension requires this class but it is not embedded directly [16:13:47] * Krinkle files task [16:14:03] zeljkof: Cool. [16:14:27] Krinkle: oh,didn't even notice that, is that a new blocker? [16:14:43] I'm using https://logstash.wikimedia.org/app/kibana#/dashboard/0a9ecdc0-b6dc-11e8-9d8f-dbc23b470465 [16:15:05] anything that isn't a smiley face is an error not currently tracked with a task. [16:15:16] Could be infrequent, so I'm doule checking with mediawiki-errors plain to see if it is new [16:15:23] But the 3D one in particular I know is new [16:16:25] Krinkle: Broken by the recent changes? [16:16:28] not sure why that fatals, I thought PHP class names were not case sensitive [16:16:34] Yeah, it's definitely a regression [16:16:43] Oh, right, our class loader [16:16:58] We stopped supporting auto loading of non-canonical case. [16:17:11] So it works as load as something elses loaded the class first. [16:17:12] yay [16:17:14] Krinkle: no train roll back required for that one right, seems like it happens rarely? [16:17:16] anyway, UBN [16:18:13] James_F: seems that the "60 second" storm has calmed, I guess it's safe for you to do the swat? [16:18:16] 10Operations, 10IRCecho: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10Volans) Full stacktrace: ``` Sep 26 13:25:02 einsteinium ircecho[39712]: ERROR:root:Exiting Sep 26 13:25:02 einsteinium ircecho[39712]: Traceback (most recent call last): Sep 26 13:25:02 einsteinium ircecho... [16:18:31] 👍🏽 [16:18:45] !log test formatting sdd on ms-be2040 with crc=0 - T199198 [16:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:53] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [16:19:01] 10Operations, 10IRCecho: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10Krenair) Yeah, shinken-wm has been doing the same, @bd808 linked to a fix within his python3-ib3 library [16:20:30] 10Operations, 10IRCecho: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10Krenair) ``` I wonder if those characters made it fail is it written in python by chance ? Platonides, it is * Platonides suspects a UnicodeException there * bd808 is not... [16:20:59] bd808, Platonides, volans: ^ [16:22:33] Krenair: ack. thanks for pasting the irc chat bits there [16:24:02] I've pasted the stacktrace there if you want bd808, Krenair. Basically the library is not catching the decode errors [16:24:06] and make it crashes [16:24:16] now we have a quick hack in place right now [16:24:32] that I made, super hacky, for which it should not restart bu just skip the data, given that it doesn't use it [16:25:00] there have been no spam since then so cannot confirm 100% is working yet [16:25:11] tomorrow we'll register the nickname [16:25:17] volans: look at https://github.com/bd808/python-ib3/blob/master/ib3/__init__.py#L35-L40 -- the exact syntax may be different in older versions of the irc library, but the LenientDecodingLineBuffer is the trick [16:25:21] Krinkle: Do we just need to pull in the Html class? [16:25:32] that will allow it to reconnect here with the +r [16:25:33] James_F: Need to use \Html instead of \HTML [16:25:42] Ha. [16:25:51] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Papaul) frmon2001 is already on fasw-c-codfw:ge-[0|1]/0/16 so I move frauth2001 on fasw-c-codfw:ge-[0|1]/0/17 [16:25:51] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:26:10] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Papaul) [16:26:37] bd808: I'm not touching ircecho, E_NOTIME and I think it should be rewritten from scratch and we should have only one bot codebase, not many [16:26:38] James_F: Assuming ed tested, it would've worked because some other code loaded it with the correct casing first. Once loaded, PHP loaded class registry is case insensitive (horrible), but the autoloader isn't. [16:26:55] But in some cases, this code is apparently the first code to trigger the load and then fails fatally [16:26:59] Lovely. [16:27:05] bd808: by tomorrow it should be able to auth to freenode and reconnect on restart [16:27:22] from then the urgency of the fix will be hight, not UBN ;) [16:27:30] wait [16:27:33] there's a relevant xkcd here [16:27:39] https://xkcd.com/927/ [16:27:52] ofc [16:27:52] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:27:56] :) [16:28:16] I'm writing a fairy tale to the ops@ list, so stay tuned [16:29:21] I'll keep my eyes peeled [16:32:14] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529 (10Tgr) Based on mailman-users posts like [[https://mail.python.org/pipermail/mailman-users/2017-January/081797.html|this]] and [[https://mail.python.org/pipermail/mailman-users/20... [16:34:40] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/CentralAuth/includes/CentralAuthHooks.php: SWAT deploy I65ae2f05e (duration: 00m 57s) [16:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:58] James_F: Wanna swat https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ORES/+/463098/ as well? [16:35:07] Means we've got 0 regressions from wmf.22 :) [16:35:23] Sure. [16:35:48] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.22/extensions/CentralAuth/includes/CentralAuthHooks.php: SWAT deploy I65ae2f05e (duration: 00m 56s) [16:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:40] !log shutting down all ulsfo servers for relocation [16:36:41] !log shutting down cr1/2-ulsfo for DC move [16:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:52] RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [16:38:42] (03PS1) 10Andrew Bogott: Horizon: remove region config for some deleted projects [puppet] - 10https://gerrit.wikimedia.org/r/463109 [16:40:32] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:40:58] 10Operations, 10IRCecho: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10Volans) p:05Triage>03High [16:41:11] (03PS2) 10Ayounsi: cr1/2-ulsfo -> cr3/4-ulsfo renaming [dns] - 10https://gerrit.wikimedia.org/r/461228 (https://phabricator.wikimedia.org/T189552) [16:41:22] XioNoX: why? :) [16:41:39] ah, I misunderstood the change, nevermind! [16:43:03] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: ORES overload incident, 2017-11-28 - https://phabricator.wikimedia.org/T181538 (10awight) [16:43:09] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Investigate overload condition, seems that we lose nodes - https://phabricator.wikimedia.org/T181634 (10awight) 05Open>03Resolved a:03awight Fixed on the new cluster. This was caused by out-of-memory. [16:43:36] (03CR) 10Ayounsi: [C: 032] cr1/2-ulsfo -> cr3/4-ulsfo renaming [dns] - 10https://gerrit.wikimedia.org/r/461228 (https://phabricator.wikimedia.org/T189552) (owner: 10Ayounsi) [16:44:07] 10Operations, 10ORES, 10Scap, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current): ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619 (10Halfak) [16:44:15] Hmm. I can't ssh into prod any more 'cos I was using bast4001. [16:44:19] Not ideal during a deploy. :-) [16:44:54] * James_F switches to 2001. [16:45:30] James_F: ulsfo (400x) is being relocated as we speak [16:45:47] paravoid: Yeah, I noticed. :-) [16:45:53] k :) [16:46:18] (03PS7) 10Andrew Bogott: keystone: Create top-level domain for each new project [puppet] - 10https://gerrit.wikimedia.org/r/375089 (https://phabricator.wikimedia.org/T162977) (owner: 10Alex Monk) [16:46:35] (03CR) 10Andrew Bogott: [C: 032] Horizon: remove region config for some deleted projects [puppet] - 10https://gerrit.wikimedia.org/r/463109 (owner: 10Andrew Bogott) [16:46:49] 10Operations, 10IRCecho: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10Volans) For now on einsteinium there is a super hacky quick fix that should keep it alive (no more spam was registered so far). Tomorrow we should be able to register the icinga-wm nick and let ircecho auth... [16:47:20] paravoid: Thankfully I hadn't pressed return on my deploy command when I lost connection. :-) [16:47:27] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/ORES/includes/SpecialORESModels.php: SWAT T205228 fix Iddb9c1f9e (duration: 00m 56s) [16:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:35] T205228: SpecialORESModels.php: PHP Notice: Undefined index: likelygood, likelybad, verylikelybad - https://phabricator.wikimedia.org/T205228 [16:47:49] (03PS1) 10Cwhite: icinga: enforce mode on nagios files [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) [16:48:14] Krinkle: Deployed. [16:48:19] (03CR) 10jerkins-bot: [V: 04-1] icinga: enforce mode on nagios files [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [16:48:56] Chrome's addressbar is driving me bad. Double-click the subdomain, end up selecting the protocol that appears in its place. [16:49:02] OK, all looks clear. [16:49:15] confirmed, ??? appears on https://he.wikipedia.org/wiki/Special:ORESModels and no errors [16:50:08] (03PS8) 10Andrew Bogott: keystone: Create top-level domain for each new project [puppet] - 10https://gerrit.wikimedia.org/r/375089 (https://phabricator.wikimedia.org/T162977) (owner: 10Alex Monk) [16:51:41] OK, declaring SWAT done. Conch released. [16:52:09] (Also, the canary check logstash link is hard-coded to the eqiad servers. Tut.) [16:53:13] (03CR) 10Cwhite: "> I'm still wondering if all this is really needed as it seems quite" [puppet] - 10https://gerrit.wikimedia.org/r/462793 (owner: 10Cwhite) [16:53:33] (03Abandoned) 10Cwhite: Revert "Revert "monitoring: set mode on host and service configs"" [puppet] - 10https://gerrit.wikimedia.org/r/462793 (owner: 10Cwhite) [16:59:01] 10Operations, 10IRCecho: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10Volans) In case you need to revert the hack before tomorrow: ``` ssh einsteinium.wikimedia.org sudo cp /home/volans/client.py /usr/lib/python2.7/dist-packages/irc/client.py # Set #wikimedia-operations (and... [17:05:42] Krinkle: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/3D/+/463114 [17:15:09] I'll do another quick deploy with ^. [17:17:42] k [17:21:01] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:22:55] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/3D/src/ThreeDThumbnailImage.php: Hot-deploy I5bb4b699a fix for T205554 train-blocker (duration: 00m 56s) [17:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:03] T205554: PHP Fatal Error "Class undefined: HTML" from ThreeDThumbnailImage.php - https://phabricator.wikimedia.org/T205554 [17:24:16] Krinkle: with that purge parse cache maintenance script I should be able to purge the cache in stages right? [17:24:21] All done. [17:25:22] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:25:32] addshore: Yes [17:25:42] PROBLEM - exim queue on mx2001 is CRITICAL: CRITICAL: 3067 mails in exim queue. [17:29:09] (03PS1) 10D3r1ck01: Enable Page-previews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) [17:31:34] looking at mx2001 [17:32:04] Krinkle: lovely, I'll do that a bit later [17:34:08] 3k messages in the queue for a single recipient [17:38:21] their server has been throwing 421s for ~48h, so queueing and should deliver after that. high volume though, looks mostly from wikidatawiki [17:41:11] *reads up* [17:41:38] Mx? Wikidata mail? [17:42:14] (03CR) 10Dzahn: "btw, my plan for the next step was to (manually) remove/purge all the icinga packages from icinga1001, delete all the files, and then run " [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:43:45] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) @Krinkle one question that I still have (not sure if it was already answered... [17:44:14] _joe_: ack, will review tomorrow [17:45:30] (03CR) 10Dzahn: icinga: enforce mode on nagios files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:46:33] Hi ops-team - Little ping from Analytics-team before deploying AQS [17:48:16] <_joe_> joal: need assistance or just a warning? [17:48:49] Just a warning thanks _joe_ :) I have elukey with me, so all good ;) [17:49:24] !log joal@deploy1001 Started deploy [analytics/aqs/deploy@39b909e]: Deploy Wikistats2 top-metrics updates [17:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:29] addshore: yes looks like a relatively high volume of mail from wikidata is going to a user but their email provider is currently down [17:51:28] (03CR) 10Dzahn: "> Will deploy together with https://gerrit.wikimedia.org/r/450228" [puppet] - 10https://gerrit.wikimedia.org/r/450314 (owner: 10Dzahn) [17:54:32] (03CR) 10Dzahn: [C: 04-1] "need to change the actual code as well, not just the comment.." [puppet] - 10https://gerrit.wikimedia.org/r/456312 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:01:22] !log joal@deploy1001 Finished deploy [analytics/aqs/deploy@39b909e]: Deploy Wikistats2 top-metrics updates (duration: 11m 58s) [18:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:52] RECOVERY - High lag on wdqs2003 is OK: (C)3600 ge (W)1200 ge 1167 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:29:36] 10Operations, 10Scoring-platform-team: Let the ORES application set log severity, not uWSGI - https://phabricator.wikimedia.org/T181546 (10awight) @Ladsgroup This might be of interest for your logging spree. [18:32:54] (03PS4) 10Cwhite: naggen2: restrict generated defines to valid options [puppet] - 10https://gerrit.wikimedia.org/r/462791 (https://phabricator.wikimedia.org/T202782) [18:33:53] (03PS2) 10Dzahn: systemd::sidekick: replace base_service::unit comment with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456312 (https://phabricator.wikimedia.org/T194724) [18:34:36] (03CR) 10jerkins-bot: [V: 04-1] systemd::sidekick: replace base_service::unit comment with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456312 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:34:39] (03CR) 10Cwhite: "Updated to take feedback into account. Decided to retain the whitelist for now." [puppet] - 10https://gerrit.wikimedia.org/r/462791 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:36:50] (03CR) 10Dzahn: "uhmm.. yea.. not sure how to solve this yet " Could not find resource 'Systemd::Service[parent-service]' for relationship on 'Systemd::Sid" [puppet] - 10https://gerrit.wikimedia.org/r/456312 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:44:24] (03PS2) 10Dzahn: cache::text: replace (commented) mwmaint1001 with mwmaint1002 [puppet] - 10https://gerrit.wikimedia.org/r/462036 (https://phabricator.wikimedia.org/T201343) [18:45:46] (03PS3) 10Dzahn: cache::text: replace (commented) mwmaint1001 with mwmaint1002 [puppet] - 10https://gerrit.wikimedia.org/r/462036 (https://phabricator.wikimedia.org/T201343) [18:46:16] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 2 others: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567 (10Krinkle) [18:46:50] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567 (10Krinkle) Tagging Operations/media-storage as well given this is a fairly unusual backend error. Might be due... [18:48:54] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567 (10Krinkle) [18:55:09] (03PS1) 10GTirloni: cloudvirt102[0-2]: Add IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/463130 [18:56:04] (03CR) 10Smalyshev: cumin: added aliases for each wdqs clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463088 (https://phabricator.wikimedia.org/T205542) (owner: 10Mathew.onipe) [18:59:26] (03CR) 10Smalyshev: [C: 04-1] Switch public cluster to Kafka event source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462907 (https://phabricator.wikimedia.org/T189458) (owner: 10Mathew.onipe) [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T1900) [19:01:54] (03CR) 10Smalyshev: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/461862 (https://phabricator.wikimedia.org/T202830) (owner: 10Smalyshev) [19:05:24] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567 (10Tgr) That's thrown in response of memcache error. IIRC most BagOStuff subclasses do not have any usefuk error... [19:07:10] (03CR) 10GTirloni: "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/463130 (owner: 10GTirloni) [19:12:17] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10Nuria) [19:13:11] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:15:29] (03CR) 10Effie Mouzeli: [C: 032] Added zone wikipedia.gr [dns] - 10https://gerrit.wikimedia.org/r/462730 (owner: 10Effie Mouzeli) [19:15:46] (03PS2) 10Effie Mouzeli: Added zone wikipedia.gr [dns] - 10https://gerrit.wikimedia.org/r/462730 [19:16:25] (03CR) 10Andrew Bogott: [C: 031] cloudvirt102[0-2]: Add IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/463130 (owner: 10GTirloni) [19:16:58] (03CR) 10Paladox: Added zone wikipedia.gr (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/462730 (owner: 10Effie Mouzeli) [19:17:41] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:17:48] (03CR) 10GTirloni: [C: 032] cloudvirt102[0-2]: Add IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/463130 (owner: 10GTirloni) [19:20:56] (03CR) 10Faidon Liambotis: "Wouldn't a symlink to wikimedia.com be more appropriate given the rest of wikipedia.org's contents are not needed for this?" [dns] - 10https://gerrit.wikimedia.org/r/462730 (owner: 10Effie Mouzeli) [19:22:55] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281 (10Nuria) 05Open>03Resolved [19:23:21] (03CR) 10Effie Mouzeli: [C: 032] "I used wikipedia.ee as an example which is already configured" [dns] - 10https://gerrit.wikimedia.org/r/462730 (owner: 10Effie Mouzeli) [19:25:04] !log manually installing python3-jinja2 on puppetmaster1001 to test naggen2 python3 upgrade [19:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:13] is some maintenance going on with bast4001.wikimedia.org ? Getting timeouts. Should I switch to another one? [19:25:26] SMalyshev: yes, ulsfo is being moved [19:25:51] SMalyshev: will be unavailable today, hopefully back tomorrow (but no promises), so please switch to the other bastions [19:26:01] ok, thanks, switching [19:26:30] yw! [19:26:33] there is also bast4002. currently we have 2 bastions there [19:26:43] I wouldn't rely on any 4xxx hosts today [19:28:17] yeah I'll avoid 4xxx for now [19:31:44] (03CR) 10Effie Mouzeli: [C: 032] Added zone wikipedia.gr (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/462730 (owner: 10Effie Mouzeli) [19:35:57] 10Operations, 10fundraising-tech-ops, 10netops: deploy new pfw config - https://phabricator.wikimedia.org/T205574 (10cwdent) [19:38:42] (03PS2) 10Dzahn: icinga::ircbot: move from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/462835 [19:39:18] (03PS5) 10Cwhite: naggen2: restrict generated defines to valid options [puppet] - 10https://gerrit.wikimedia.org/r/462791 (https://phabricator.wikimedia.org/T202782) [19:39:20] (03PS1) 10Cwhite: naggen2: python3 and remove activerecord support [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) [19:42:46] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/Gadgets/includes/SpecialGadgets.php: (no justification provided) (duration: 00m 57s) [19:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:19] (03CR) 10Krinkle: [C: 04-1] Beta: enable MobileFrontend and move some config in to labs settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462819 (https://phabricator.wikimedia.org/T205495) (owner: 10Imarlier) [19:44:27] (03PS2) 10Cwhite: icinga: enforce mode on nagios files [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) [19:44:55] (03CR) 10Krinkle: [C: 04-1] Beta: enable MobileFrontend and move some config in to labs settings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462819 (https://phabricator.wikimedia.org/T205495) (owner: 10Imarlier) [19:44:58] (03PS3) 10Dzahn: icinga::ircbot: move from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/462835 [19:45:53] (03CR) 10Cwhite: [C: 031] icinga::ircbot: move from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/462835 (owner: 10Dzahn) [19:46:30] (03PS4) 10Dzahn: icinga::ircbot: move from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/462835 [19:47:48] (03CR) 10Imarlier: Beta: enable MobileFrontend and move some config in to labs settings (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462819 (https://phabricator.wikimedia.org/T205495) (owner: 10Imarlier) [19:47:50] (03PS4) 10Imarlier: Beta: enable MobileFrontend and move some config in to labs settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462819 (https://phabricator.wikimedia.org/T205495) [19:49:49] (03CR) 10Dzahn: "this is already the case on icinga1001 and would not influence einsteinium/tegmen. so looks good. though.. if it's already the case now on" [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [19:52:19] 10Operations, 10DNS, 10Domains, 10Traffic, and 2 others: redirect wikipedia.gr to el.wikipedia.org - https://phabricator.wikimedia.org/T205077 (10jijiki) 05Open>03Resolved Please note that although we have enabled `*.wikipedia.gr` for misguided users, we suggest to not actively enable users to use it f... [19:54:58] !log icinga1001 - temp. disabling puppet, remove --purge all icinga packages, rm -rf /etc/nagios and /etc/icinga, let puppet recreate everything now that it should not mess with user/group on stretch (T202782) [19:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:06] T202782: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T2000). [20:01:11] no parsoid deploy today [20:06:53] (03CR) 10Jdlrobson: [C: 04-1] Enable Page-previews on Wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) (owner: 10D3r1ck01) [20:09:13] PROBLEM - puppet last run on cloudvirt1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [20:10:24] checking cloudvirt1021 [20:14:13] (03PS1) 10GTirloni: Add IPv6 for cloudvirt102[0-2].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/463138 (https://phabricator.wikimedia.org/T205524) [20:14:13] RECOVERY - puppet last run on cloudvirt1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:16:17] (03CR) 10Andrew Bogott: [C: 031] "Reading those reverse ipv6 addresses makes my head hurt but this looks right." [dns] - 10https://gerrit.wikimedia.org/r/463138 (https://phabricator.wikimedia.org/T205524) (owner: 10GTirloni) [20:21:59] (03CR) 10GTirloni: [C: 032] Add IPv6 for cloudvirt102[0-2].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/463138 (https://phabricator.wikimedia.org/T205524) (owner: 10GTirloni) [20:41:39] jouncebot: now [20:41:40] For the next 0 hour(s) and 18 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T1900) [20:41:41] For the next 0 hour(s) and 18 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T2000) [20:42:35] addshore: Nothing actually happening right now. [20:42:43] James_F: lovely :) [20:42:53] jouncebot: next [20:42:53] In 2 hour(s) and 17 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T2300) [20:43:11] I'm going to go ahead and start purging the parser cache for wikidata-intern using the maintenance script in a short while [20:43:13] ^^^ That's the next scheduled window, and it's only got one patch. [20:43:31] * addshore is just going to go for a day at a time and monitor resource usage etc [20:43:35] Right. CC Perf, SRE, RelEng. ;-) [20:44:02] Heh, indeed, not a cache epoch bump though ;) [20:44:48] (03CR) 10Krinkle: "Nice! ):" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/461229 (https://phabricator.wikimedia.org/T204154) (owner: 10Ppchelko) [20:45:34] Krinkle: ^^ fyi [20:46:57] (03CR) 10Krinkle: "The task mentioned 'retry-after'. I think using sleep would be fine, if we're worried about adding complexity to CP." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) (owner: 10Mobrovac) [20:48:22] PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sde1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sde1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [20:51:50] addshore: what is "wikidata-intern" [20:52:27] did you mean test wikidata? [20:53:50] Hmm, in reference to what? [20:54:10] Also, purgeParserCache.pph purses all parser caches? Not a specific wiki? [20:54:48] Aaah, sorry, that message before way autocorrect apparently... It was meant to say wikidatawiki .... [20:59:07] (03CR) 10Ppchelko: [C: 031] "retry-after and timeout are designed to fix 2 different problems." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) (owner: 10Mobrovac) [20:59:11] (03PS1) 10Ayounsi: Update mr1-ulsfo OOB IP with DR DIA [dns] - 10https://gerrit.wikimedia.org/r/463142 [20:59:11] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:00:03] (03CR) 10Ayounsi: [C: 032] Update mr1-ulsfo OOB IP with DR DIA [dns] - 10https://gerrit.wikimedia.org/r/463142 (owner: 10Ayounsi) [21:00:24] ooooof [21:00:33] ERR_CONNECTION_REFUSED [21:00:55] (03CR) 10Cwhite: "> Is it ? I don't see anything in any of" [puppet] - 10https://gerrit.wikimedia.org/r/462600 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:01:37] Krinkle: what cache key si the cache epoch used in? [21:01:39] (^ about toolserver.org) [21:02:07] addshore: ResourceLoader module version hash, e.g. all load.php varnish urls and localStorage keys. [21:02:19] Also virtually anything else we can think of anywhere in MediaWiki that stores anything anywhere. [21:02:43] ... which is why I think it should be dimantled in favour of something that relates just to ParserCache. [21:03:18] !log tools-mail deleted frozen messages from exim queue [21:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:01] but if I use the PC purge script i have to purge all wikis, not just wikidatawiki? [21:04:17] why? [21:04:25] beucase thats what the script does :/ [21:04:32] I'm not following [21:04:53] I thought it's in core, so it goes via mwscript --wiki. [21:04:59] the only option in the script is expiredate or age, and whatever --wiki you run it on, it will purge the pcache for all wikis as far as I can see [21:05:03] or is it a custom WikimediaMaintenance script? [21:05:09] the cron job just runs it wiki --wiki aawiki [21:05:15] right [21:05:33] /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=1900800 --msleep 500 >/dev/null 2>&1 [21:06:50] addshore: bumping epoch at run time bypasses that [21:07:02] but yeah, it means the CLI won't work unless we do it for all. [21:07:25] hmmmm [21:07:39] I still think using CLI to purge >20day of pc for all wikis is better than bumping cache epoch. [21:07:50] but what cut off what you looking for? [21:07:55] 19th [21:08:10] looking to cut off 15 days [21:08:17] and happy to do it in a couple of chunks [21:08:21] so purge older than 15 days [21:08:35] current cut off being 22 days [21:09:10] I think bumping cache epoch for just wikidata would be fine if we do it at once (from the POV of everything that is not parser cache) [21:09:20] I don't know if it's fine for PC. [21:09:22] purging all cache older than 7 days (keeping caches since the 19th) [21:09:43] So our options are gradual purge of parser cache for all wikis , or instant purge of all caches within wikidata only. [21:09:49] yup [21:10:12] well, not all caches, but all values older than 15 days in all types of caches. [21:10:24] *older than 7 days [21:10:44] oh, I see. [21:10:51] cut off 15 from 22 [21:10:56] yes [21:11:46] but yes, batching seems safer to me. I'd even be happy to do it in day increments really over the course of the next 12 hours for example... [21:11:55] except I'll be sleeping for some of that, but you get my ideas :) [21:11:57] *idea [21:12:12] Yeah, we'd really have to know how parser cache performs typically to know the impact [21:12:18] neither of the options is obviously safe for me. [21:12:23] Cant you make wgCacheEpoch some smooth function? To slowly bump out things on wikidata [21:12:34] already covered that. [21:12:51] wgCacheEpoch is used as key (not just as date) for lots of things that are not parser cache [21:13:00] :( [21:13:13] sounds like there should be a ticket for not using it in keys :P [21:13:16] it's become the nuclear wildcard of caches. [21:13:34] $wgInvalidateCacheOnLocalSettingsChange [21:13:50] heh [21:14:00] (actual variable name) [21:14:04] brb, rfc meeting [21:19:22] Right, well, I'm going purge 1 day of the parser cache and see how it affects things. I imagine the last 7 days are probably the most volatile, but also the days that I dont need to touch [21:20:03] (03PS1) 10Herron: WIP: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) [21:20:38] !log restarting releases jenkins for updates [21:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:45] (03CR) 10jerkins-bot: [V: 04-1] WIP: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [21:26:34] (03PS1) 10GTirloni: tools mail: write RBL check warning to file [puppet] - 10https://gerrit.wikimedia.org/r/463144 (https://phabricator.wikimedia.org/T202558) [21:26:52] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:28:01] 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Ban spam arriving to my tools email - https://phabricator.wikimedia.org/T202558 (10GTirloni) I couldn't find any warning messages in the log files. It seems `warn message` will actually return the message to the SMTP client whi... [21:29:02] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:37:59] (03PS1) 10BryanDavis: toolforge: Use valid docroot for www.toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/463146 [21:40:36] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10herron) >>! In T41785#4616409, @Andrew wrote: > rebuild those VMs there, and then we can delete the project-smtp project. We'll wind up with new IPs and all over ther... [22:00:42] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate stable.toolserver.org valid until 2018-11-07 16:17:05 +0000 (expires in 41 days) [22:01:46] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [22:02:00] * addshore just sent 1 more email about the wikidata parser cache purge. tldr, I'll do it tomorrow during eu daytime using the script [22:02:53] 10Operations, 10hardware-requests: Refresh (leased) restbase2001-2006 - https://phabricator.wikimedia.org/T205092 (10Eevans) NOTE: As discussed with @mark and @Fjalapeno earlier today, the current cluster is overly heterogeneous (Dell & HP of different specs, Samsung & Intel SSDs of varying size and number); I... [22:04:02] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:04:12] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:04:29] (03CR) 10Dzahn: [C: 032] toolforge: Use valid docroot for www.toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/463146 (owner: 10BryanDavis) [22:06:31] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:07:31] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate stable.toolserver.org valid until 2018-11-07 16:17:05 +0000 (expires in 41 days) [22:20:33] the toolserver flapping alerts is me trying to fix it [22:21:00] got to the part where apache stays fine and just puppet has an issue. but better than puppet breaking apache :) [22:23:36] Hmm, wikibugs hasn't mentioned a task in here since ~12:00 SF time. Has there really been no Phab activity on SRE tickets since then? [22:24:06] I'll give it a poke [22:24:20] Ta. [22:24:50] 10Operations: test bot - https://phabricator.wikimedia.org/T205589 (10Krenair) [22:24:51] 10Operations: test bot - https://phabricator.wikimedia.org/T205589 (10Krenair) [22:24:58] wait why was that twice [22:25:10] 10Operations: test bot - https://phabricator.wikimedia.org/T205589 (10Krenair) twice? [22:25:21] okay... [22:25:32] 10Operations: test bot - https://phabricator.wikimedia.org/T205589 (10Krenair) 05Open>03Invalid [22:25:34] Hmm, creations appear twice and edits once? [22:25:42] well I did qmod -rj [22:25:51] It might have started the new one and then killed the old? [22:25:58] Plausibly. [22:26:00] idk [22:26:11] Anyway, seemingly working now? [22:26:13] yeah [22:26:22] Thanks. [22:27:22] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi ulsfo DC move https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:28:37] !log enable BGP to transits on cr3-ulsfo [22:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:52] (03CR) 10Ayounsi: [C: 032] Puppet, rename all instances of cr1/2-ulsfo to cr3/4 [puppet] - 10https://gerrit.wikimedia.org/r/461233 (https://phabricator.wikimedia.org/T189552) (owner: 10Ayounsi) [22:34:36] (03PS2) 10Ayounsi: Puppet, rename all instances of cr1/2-ulsfo to cr3/4 [puppet] - 10https://gerrit.wikimedia.org/r/461233 (https://phabricator.wikimedia.org/T189552) [22:35:19] (03CR) 10Ayounsi: [C: 032] Puppet, rename all instances of cr1/2-ulsfo to cr3/4 [puppet] - 10https://gerrit.wikimedia.org/r/461233 (https://phabricator.wikimedia.org/T189552) (owner: 10Ayounsi) [22:39:19] (03PS1) 10Dzahn: toolserver: add status.toolserver.org ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/463153 [22:42:18] (03CR) 10Dzahn: [C: 032] toolserver: add status.toolserver.org ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/463153 (owner: 10Dzahn) [22:42:56] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:44:56] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:45:06] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 26 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:50:07] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 16 probes of 315 (alerts on 25) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:50:13] (03PS1) 10Dzahn: toolserver_legacy: add status.toolserver.org to LE cert [puppet] - 10https://gerrit.wikimedia.org/r/463156 [22:51:18] (03CR) 10Dzahn: [C: 032] toolserver_legacy: add status.toolserver.org to LE cert [puppet] - 10https://gerrit.wikimedia.org/r/463156 (owner: 10Dzahn) [22:51:38] (03PS1) 10BryanDavis: toolforge: Update toolserver redirect pages [puppet] - 10https://gerrit.wikimedia.org/r/463157 [22:54:19] (03CR) 10BryanDavis: toolforge: Update toolserver redirect pages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463157 (owner: 10BryanDavis) [22:54:52] mutante: if you are not sick of toolserver_legacy yet ^ [22:57:41] oh, yea. that looks fine. doing [22:57:51] (03PS2) 10Dzahn: toolforge: Update toolserver redirect pages [puppet] - 10https://gerrit.wikimedia.org/r/463157 (owner: 10BryanDavis) [22:59:07] (03CR) 10Dzahn: [C: 032] toolforge: Update toolserver redirect pages [puppet] - 10https://gerrit.wikimedia.org/r/463157 (owner: 10BryanDavis) [22:59:16] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 65, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180926T2300). [23:00:04] jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:05] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Andrew) I created public DNS: ``` Andrews-MacBook-Pro-3:~ andrew$ dig +short mx-out01.cloudinfra.wmflabs.org 185.15.56.18 Andrews-MacBook-Pro-3:~ andrew$ dig +short m... [23:00:06] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:07] PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:07] PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:16] PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:16] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:16] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:17] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:17] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:17] PROBLEM - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:20] \o [23:00:26] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:26] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:26] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:27] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:27] PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:27] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:36] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:37] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:46] PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:46] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:47] PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:54] those are all ulsfo hosts.. i guess it's part of the move ? [23:00:56] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:56] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:56] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:57] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:57] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:00:57] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:00:57] PROBLEM - IPsec on cp1075 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:01:06] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4031_v6, cp4032_v4, cp4032_v6 [23:01:06] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:01:06] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:01:06] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:01:06] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp4031_v6 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4031_v4, cp4032_v4, cp4032_v6 [23:01:07] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:01:07] PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [23:01:12] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) I think we were going to do DNS directly under wmflabs.org, if the name is decided? [23:02:57] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 63, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:03:38] mutante: yeah, the downtime expired, extending it [23:03:55] cool, thx [23:04:03] are swats cancelled today because of all that ^^^ ? [23:04:10] no [23:04:29] great :) any of the evening swatters able to help me out? [23:05:29] (03PS1) 10Ayounsi: Icinga, update mr1-ulsfo OOB IP [puppet] - 10https://gerrit.wikimedia.org/r/463159 [23:08:24] MaxSem: around? [23:08:35] meeting, sorry [23:10:36] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [23:10:50] thcipriani: don't suppose you are around? [23:11:38] * mutante merges that unmerged change [23:11:46] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [23:12:00] * thcipriani stops mid-way through slinking out the door [23:12:06] I can SWAT [23:13:25] thank you thcipriani <3 [23:13:50] sure thing :) [23:18:30] jdlrobson: your change is live on mwdebug2002, check please [23:18:40] _joe_ hi, wondering when you get a chance could you review / merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/462770 please? :) [23:18:44] testing! [23:19:18] thcipriani follow up from the puppet merge on monday. [23:19:19] (03CR) 10Ayounsi: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12645/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/463159 (owner: 10Ayounsi) [23:21:19] LGTM thcipriani please sync! [23:21:24] * thcipriani does [23:23:20] bd808: the new logo and links are online btw [23:23:44] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.readingDepth.js: SWAT: [[gerrit:463139|Ensure Minerva has initialised before loading and executing ReadingDepth]] T204144 (duration: 00m 57s) [23:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:52] ^ jdlrobson live now [23:23:52] T204144: ReadingDepth sometimes initialises before PageIssues leading to incorrect or missing ReadingDepth events - https://phabricator.wikimedia.org/T204144 [23:24:14] thanks mutante! The 404 handler seems to be not doing what I expected. I may try to figure that out [23:24:39] thanks thcipriani will monitor the graphs [23:24:42] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Delete / Redirect): Redirect several wikis - https://phabricator.wikimedia.org/T169450 (10Liuxinyu970226) [23:24:49] jdlrobson: cool, sounds good :) [23:25:03] ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 65, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi ulsfo dc move https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:25:03] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 63, down: 3, dormant: 0, excluded: 0, unused: 0: Ayounsi ulsfo dc move https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:28:13] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [23:30:14] (03CR) 10Dzahn: "19:47:12 Resolved violations:" [puppet] - 10https://gerrit.wikimedia.org/r/462835 (owner: 10Dzahn) [23:33:22] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/12646/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/462835 (owner: 10Dzahn) [23:33:57] (03CR) 10Dzahn: [C: 032] icinga::ircbot: move from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/462835 (owner: 10Dzahn) [23:34:05] (03PS5) 10Dzahn: icinga::ircbot: move from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/462835 [23:38:09] thanks thcipriani nothing on fire so we're all good here. Thanks for jumping in to help! [23:41:54] jdlrobson: yw, thanks for making sure nothing is on fire :) [23:56:52] PROBLEM - Juniper alarms on mr1-ulsfo is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 198.24.47.102 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [23:57:21] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.24.47.102 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down