[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181127T0000). [00:00:04] marlier, bmansurov, Jdlrobson, Zoranzoki21, and MaxSem: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:09] I'll SWAT! [00:00:10] here [00:00:27] Hi [00:00:55] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475105 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [00:02:35] (03Merged) 10jenkins-bot: Labs: enable the reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475105 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [00:03:49] (03PS1) 10Papaul: DHCP: Add MAC address entries for sessionstore200[123] [puppet] - 10https://gerrit.wikimedia.org/r/475924 (https://phabricator.wikimedia.org/T209389) [00:03:53] !log niharika29@deploy1001 sync-file aborted: (no justification provided) (duration: 00m 03s) [00:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:33] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475367 (https://phabricator.wikimedia.org/T210171) (owner: 10Zoranzoki21) [00:05:06] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Enable trust survey on labs T209882 (duration: 00m 46s) [00:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:09] T209882: Quicksurvey for reader trust - https://phabricator.wikimedia.org/T209882 [00:05:11] bmansurov: Done^ [00:05:54] (03Merged) 10jenkins-bot: Delete 'Импортировано' namespace from ru.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475367 (https://phabricator.wikimedia.org/T210171) (owner: 10Zoranzoki21) [00:06:11] Niharika: Should I test my patch or it can be moved directly at production? [00:06:34] Zoranzoki21: I am not sure actually. Is the namespace empty? [00:06:44] Niharika: Yes [00:07:01] \o mine is a no op [00:07:14] Niharika: Thanks, but I don't see the change yet. Do I have to wait a little? [00:07:41] Zoranzoki21: Let's try to test it. The namespace should not show up in the search dropdown once it's deleted. [00:08:04] bmansurov: I synced the change so it should be showing up now...Which wiki are you testing on? [00:08:08] Niharika: Ok. It will be on mwdebug1002? [00:08:21] Niharika: I'm testing it here: https://en.wikipedia.beta.wmflabs.org/wiki/Book [00:08:27] Zoranzoki21: Yes, give me a minute. [00:09:03] bmansurov: Seems alright. Maybe caching issues? Give it a few minutes maybe? [00:09:14] Niharika: OK [00:09:53] Zoranzoki21: It's on mwdebug1002 now. [00:10:03] Hi I'm here, sorry for delay [00:10:37] Niharika: Looks ok [00:10:41] marlier: No worries. It'll take me a few minutes to get to you. [00:11:22] Zoranzoki21: Yeah, I see it's no longer in the search. Syncing it... [00:11:57] The beta-config change will be live there when the postmerge Jenkins job for 475105,6 at https://integration.wikimedia.org/zuul/ and it child jobs complete, which hasn't started yet. [00:12:47] bmansurov: ^ [00:12:52] Thanks Krinkle. [00:12:53] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Delete a namespace from ruwikisource T210171 (duration: 00m 46s) [00:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:57] T210171: Delete 'Импортировано' namespace from ru.wikisource - https://phabricator.wikimedia.org/T210171 [00:13:03] Zoranzoki21: ^ Please test in prod. [00:13:12] Niharika: ok, thanks [00:13:24] (03PS3) 10Niharika29: Set wgMinervaSchemaMainMenuClickTrackingSampleRate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475882 (https://phabricator.wikimedia.org/T205008) (owner: 10Jdlrobson) [00:14:05] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475882 (https://phabricator.wikimedia.org/T205008) (owner: 10Jdlrobson) [00:14:14] Niharika: Looks good at production [00:14:17] (03CR) 10jenkins-bot: Labs: enable the reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475105 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [00:14:19] (03CR) 10jenkins-bot: Delete 'Импортировано' namespace from ru.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475367 (https://phabricator.wikimedia.org/T210171) (owner: 10Zoranzoki21) [00:15:06] bmansurov: beta also auto-deploys changes from mediawiki (as opposed to op/mw-config), and it is currently doing an unrelated auto-deploy still, which take about 25min each, and has about 10min left. Then after that, your change will auto deploy after that and complete about 35min from now. [00:15:26] (03Merged) 10jenkins-bot: Set wgMinervaSchemaMainMenuClickTrackingSampleRate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475882 (https://phabricator.wikimedia.org/T205008) (owner: 10Jdlrobson) [00:15:40] This is not visible on Gerrit, but is shown at https://integration.wikimedia.org/ci/ under deployment-deploy01 [00:15:54] (where deployment = beta cluster, because names are hard) [00:16:06] Krinkle: I see. Do you think my change will work as is or do I have to change wg to wmg on line 408 of https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/475105/5/wmf-config/InitialiseSettings-labs.php? [00:17:01] bmansurov: wg should work as-is. wmg would not work, actually. 'wg' is for real configuration variables used by MW or MW extensions. 'wmg' is for fake configuration variables that we only use elsewhere within wmf-config, e.g. to control switches within CommonSettings.php [00:17:25] Krinkle: got it. Thanks! [00:17:27] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set wgMinervaSchemaMainMenuClickTrackingSampleRate T205008 (duration: 00m 46s) [00:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:31] jdlrobson: Done^ [00:17:31] T205008: AMC: wgMFSchemaMainMenuClickTrackingSampleRate should be set in production not MobileFrontend - https://phabricator.wikimedia.org/T205008 [00:17:54] (03PS5) 10Niharika29: wmf-config: Enable wgMFNoindexPages for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) (owner: 10Imarlier) [00:17:54] Niharika: I have to sign off. I'll check the status tomorrow. Please don't abort the merge. [00:18:21] bmansurov: Don't worry about it. The patch is already merged and synced to beta. [00:18:31] Niharika: ok, thanks [00:18:39] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) (owner: 10Imarlier) [00:19:14] 10Operations, 10Analytics, 10EventBus, 10WMF-JobQueue, and 5 others: Kafka eqiad.mediawiki.page-delete topic is empty - https://phabricator.wikimedia.org/T210451 (10Pchelolo) Not to be worried. We have all the failed events stored since 2018-04-18. If needed, I will fetch all the missing page deletes tomor... [00:19:48] (03Merged) 10jenkins-bot: wmf-config: Enable wgMFNoindexPages for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) (owner: 10Imarlier) [00:20:12] (03CR) 10Dzahn: [C: 032] icinga::web: do not use PHP anymore [puppet] - 10https://gerrit.wikimedia.org/r/475901 (https://phabricator.wikimedia.org/T208257) (owner: 10Dzahn) [00:20:20] (03PS2) 10Dzahn: icinga::web: do not use PHP anymore [puppet] - 10https://gerrit.wikimedia.org/r/475901 (https://phabricator.wikimedia.org/T208257) [00:21:05] marlier: Your patch is on mwdebug1002. [00:21:27] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475919 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:21:33] testing [00:21:47] 10Operations, 10Analytics, 10EventBus, 10WMF-JobQueue, and 5 others: Kafka eqiad.mediawiki.page-delete topic is empty - https://phabricator.wikimedia.org/T210451 (10Ottomata) Oo, I just did the same, or, at least I copied the relevant files. They are on stat1004:/home/otto/eventbus-validation-logs0. Stas... [00:22:35] (03PS2) 10Niharika29: Enable SVGs in page language everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475919 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:22:40] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475919 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:23:24] 10Operations, 10Analytics, 10EventBus, 10WMF-JobQueue, and 5 others: Kafka eqiad.mediawiki.page-delete topic is empty - https://phabricator.wikimedia.org/T210451 (10Smalyshev) I think I've extracted all I need from the DB tables for now, but I'll double-check and if anything is still missing I check the ex... [00:23:59] (03Merged) 10jenkins-bot: Enable SVGs in page language everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475919 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:25:28] MaxSem: You're up on mwdebug1002 too. [00:25:48] Niharika: WFM [00:27:12] (03CR) 10jenkins-bot: Set wgMinervaSchemaMainMenuClickTrackingSampleRate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475882 (https://phabricator.wikimedia.org/T205008) (owner: 10Jdlrobson) [00:27:14] (03CR) 10jenkins-bot: wmf-config: Enable wgMFNoindexPages for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) (owner: 10Imarlier) [00:27:16] (03CR) 10jenkins-bot: Enable SVGs in page language everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475919 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:27:42] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable SVGs in page language everywhere T208899 (duration: 00m 46s) [00:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:45] T208899: Rollout SVGs in page language - https://phabricator.wikimedia.org/T208899 [00:28:25] Confirmed working after the sync [00:28:32] Thanks, Niharika [00:28:37] (03PS1) 10Dzahn: icinga: stop having separate configs for jessie/stretch [puppet] - 10https://gerrit.wikimedia.org/r/475927 (https://phabricator.wikimedia.org/T202782) [00:29:08] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Enable SVGs in page language everywhere T208899 (duration: 00m 45s) [00:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:24] (03PS2) 10Dzahn: icinga: stop having separate configs for jessie/stretch [puppet] - 10https://gerrit.wikimedia.org/r/475927 (https://phabricator.wikimedia.org/T202782) [00:29:33] Thank YOU, MaxSem! [00:30:37] Niharika: you said 1002? i'm not seeing what i should be... [00:30:57] marlier: Yep. mwdebug1002. [00:31:21] It should be there unless there's a special step involved here. [00:31:56] shouldn't be... [00:32:03] (03CR) 10Dzahn: [C: 032] "Compilation results for icinga1001.wikimedia.org: no change" [puppet] - 10https://gerrit.wikimedia.org/r/475927 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:32:29] marlier: I can try syncing again. [00:33:13] If you don't mind, I'd appreciate it. [00:33:39] marlier: Done. [00:35:05] marlier: Also double checked on the server to make sure your code is on mwdebug1002. [00:36:07] MaxSem: Do you think there could be a caching issue for https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/473889/? [00:36:20] It's not showing up on mwdebug1002 for Ian. [00:36:45] It could be caching, since if affects render (specifically, the page head for every page) [00:37:07] Niharika: touch and sync InitializeSettings [00:37:49] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:37:49] But honestly, this should be in InitializeSettings [00:38:27] *Initialise [00:38:40] MaxSem: left it where it was for consistency, since it shouldn't exist for more than 1-2 weeks at the outside. We'll move to where it ought to be once we know if it matters. [00:38:57] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Fix weird caching issue maybe for Ian's patch (duration: 00m 46s) [00:38:59] MaxSem: Done. [00:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:01] didn't want to tackle refactoring the whole mobile.php file in order to run a quick experiment. [00:39:27] marlier: Try now? [00:39:41] testing [00:39:46] (03PS9) 10Dzahn: icinga: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) [00:40:12] I disagree that "temporary" tech debt should be allowed, especially debt as easily fixable as this one - there's nothing more permanent that temporary measures [00:40:31] (03CR) 10MaxSem: wmf-config: Enable wgMFNoindexPages for 6 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) (owner: 10Imarlier) [00:41:13] Still nothing :-( [00:41:31] Probably best to back it out and I'll figure out why it's not actually loading as expected. [00:41:38] Sorry [00:42:01] marlier: No worries. Sorry it didn't work. I'll revert it for now. [00:42:12] (03PS1) 10Niharika29: Revert "wmf-config: Enable wgMFNoindexPages for 6 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475928 [00:42:48] (03CR) 10Niharika29: [C: 032] "Created a revert as it did not work during Ian's testing in the SWAT window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475928 (owner: 10Niharika29) [00:44:16] (03Merged) 10jenkins-bot: Revert "wmf-config: Enable wgMFNoindexPages for 6 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475928 (owner: 10Niharika29) [00:44:27] (03CR) 10Dzahn: [C: 032] "Compilation results for icinga1001.wikimedia.org: no change https://puppet-compiler.wmflabs.org/compiler1002/13717/icinga1001.wikimedia." [puppet] - 10https://gerrit.wikimedia.org/r/473276 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:44:32] Thanks, sorry about that Niharika [00:44:48] No problem. [00:45:28] (03PS4) 10Dzahn: decom einsteinium remove from netboot and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/474390 (https://phabricator.wikimedia.org/T209738) [00:46:11] (03PS1) 10Papaul: PARTMAN: Add sessionstore200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/475929 (https://phabricator.wikimedia.org/T209389) [00:47:07] (03CR) 10Dzahn: [C: 032] PARTMAN: Add sessionstore200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/475929 (https://phabricator.wikimedia.org/T209389) (owner: 10Papaul) [00:47:19] (03PS2) 10Dzahn: PARTMAN: Add sessionstore200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/475929 (https://phabricator.wikimedia.org/T209389) (owner: 10Papaul) [00:47:22] Revert merged and fixed on mwdebug1002 too. [00:47:25] SWAT done. [00:48:50] 10Operations, 10ops-codfw, 10Services (watching): rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10Papaul) [00:50:24] (03CR) 10Dzahn: [C: 032] decom einsteinium remove from netboot and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/474390 (https://phabricator.wikimedia.org/T209738) (owner: 10Dzahn) [00:53:04] (03PS5) 10Dzahn: decom einsteinium remove from netboot and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/474390 (https://phabricator.wikimedia.org/T209738) [00:53:50] (03CR) 10jenkins-bot: Revert "wmf-config: Enable wgMFNoindexPages for 6 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475928 (owner: 10Niharika29) [00:54:56] (03CR) 10Dzahn: [C: 032] decom einsteinium remove from netboot and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/474390 (https://phabricator.wikimedia.org/T209738) (owner: 10Dzahn) [00:59:04] 10Operations, 10Icinga, 10decommission, 10monitoring: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10Dzahn) a:05Dzahn>03None [00:59:11] 10Operations, 10Icinga, 10decommission, 10monitoring: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10Dzahn) 05stalled>03Open [00:59:15] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) [01:00:08] thank you for the swat Niharika :) [01:00:46] (03PS2) 10Dzahn: DHCP: Add MAC address entries for sessionstore200[123] [puppet] - 10https://gerrit.wikimedia.org/r/475924 (https://phabricator.wikimedia.org/T209389) (owner: 10Papaul) [01:01:15] (03CR) 10Dzahn: [C: 032] DHCP: Add MAC address entries for sessionstore200[123] [puppet] - 10https://gerrit.wikimedia.org/r/475924 (https://phabricator.wikimedia.org/T209389) (owner: 10Papaul) [01:03:18] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [01:06:52] I see what the problem was [01:07:09] the variable isn't $wgDBName, it's $wgDBname [01:08:59] (03CR) 10Legoktm: "See inline comment about why it didn't work. In general, Max is right - this should have used InitialiseSettings.php, which, incidentally " (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) (owner: 10Imarlier) [01:09:10] * legoktm hugs MaxSem [01:09:26] * MaxSem hugs back [01:11:17] 10Operations, 10ops-codfw, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10Papaul) ` papaul@asw-b-codfw> show interfaces ge-4/0/1 d... [01:24:13] (03PS1) 10Faidon Liambotis: Initial commit of quotervwr [software] - 10https://gerrit.wikimedia.org/r/475933 [01:33:46] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:59:38] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:04:11] (03CR) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [02:31:41] 10Operations, 10ops-codfw, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10Papaul) [02:51:14] PROBLEM - SSH db1105.mgmt on db1105.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:26] (03PS25) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [03:05:09] (03PS1) 10Papaul: DNS: Add mgmt and production DNS entries for restbase201[3-8] [dns] - 10https://gerrit.wikimedia.org/r/475939 (https://phabricator.wikimedia.org/T209615) [03:10:24] 10Operations, 10ops-codfw: rack/setup/install elastic201[6-9], elastic202[0-9] and elastic203[0-3] - https://phabricator.wikimedia.org/T210450 (10Papaul) [03:34:58] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 883.72 seconds [03:48:01] (03PS1) 10Mathew.onipe: regex: add new elastic20[37-54] to their rows [puppet] - 10https://gerrit.wikimedia.org/r/475942 (https://phabricator.wikimedia.org/T210265) [03:51:04] PROBLEM - puppet last run on prometheus1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:51:14] RECOVERY - SSH db1105.mgmt on db1105.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) [04:10:57] (03PS2) 10Mathew.onipe: regex.yaml: add new elastic2037-elastic2054 to their rows [puppet] - 10https://gerrit.wikimedia.org/r/475942 (https://phabricator.wikimedia.org/T210265) [04:11:00] (03PS1) 10Mathew.onipe: cirrus.yaml: add new elastic2037-elastic2054 to existing clusters [puppet] - 10https://gerrit.wikimedia.org/r/475944 (https://phabricator.wikimedia.org/T210265) [04:15:20] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 252.40 seconds [04:16:58] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [04:29:03] (03PS1) 10Chad: Revert "Remove unblockself rights everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475947 [05:03:50] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [05:04:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [05:06:34] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [05:07:26] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [05:26:31] 10Operations, 10Deployments, 10Operations-Software-Development: Make failures on foreachwiki more obvious the deployer - https://phabricator.wikimedia.org/T210474 (10dbarratt) [05:26:48] 10Operations, 10Deployments, 10Operations-Software-Development: Make failures on foreachwiki more obvious the deployer - https://phabricator.wikimedia.org/T210474 (10dbarratt) [06:05:42] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:24] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:10:43] (03CR) 10MusikAnimal: [C: 04-1] "Not sure if you saw T210192. I see removing unblockself as an emergency measure that doesn't have to be permanent. I've no opposition to d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475947 (owner: 10Chad) [06:12:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475950 [06:13:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475950 (owner: 10Marostegui) [06:14:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475950 (owner: 10Marostegui) [06:15:00] !log Stop mysql on db1095:3312 to get it recloned [06:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3312 (duration: 00m 49s) [06:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:15] !log Stop mysql on db1090:3312 to clone db1095:3312 [06:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:16] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475950 (owner: 10Marostegui) [06:26:20] (03PS1) 10Marostegui: mariadb: Provision pc1007-pc1010 [puppet] - 10https://gerrit.wikimedia.org/r/475951 (https://phabricator.wikimedia.org/T208383) [06:26:53] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Provision pc1007-pc1010 [puppet] - 10https://gerrit.wikimedia.org/r/475951 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:29:01] (03PS2) 10Marostegui: mariadb: Provision pc1007-pc1010 [puppet] - 10https://gerrit.wikimedia.org/r/475951 (https://phabricator.wikimedia.org/T208383) [06:30:10] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) pc1008, pc1009 and pc1010 look good! [06:33:42] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) @Cmjohnson Actually pc1008 needs to get the RAID rebuilt - it has strip size 64. The other two pc1009 and pc1010 are ok and have 256. [06:34:01] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) [06:35:41] (03PS3) 10Marostegui: mariadb: Provision pc1007-pc1010 [puppet] - 10https://gerrit.wikimedia.org/r/475951 (https://phabricator.wikimedia.org/T208383) [07:16:46] !log Start MySQL on db1095:3312 after recloning it [07:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:58] !log Start MySQL on db1090:3312 after recloning db1095:3312 [07:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475953 [07:21:08] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475953 (owner: 10Marostegui) [07:22:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475953 (owner: 10Marostegui) [07:23:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3312 after cloning db1095:3312 (duration: 00m 46s) [07:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:42] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475953 (owner: 10Marostegui) [07:35:47] <_joe_> !log depooling mw1261 for benchmarking, T206341 [07:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:50] T206341: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 [07:46:20] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:46:48] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:49:37] (03CR) 10Filippo Giunchedi: "Swift produces a significant amount of logs (~1-4G/day compressed) I think this should wait until we have more space in eqiad/codfw for lo" [puppet] - 10https://gerrit.wikimedia.org/r/475898 (https://phabricator.wikimedia.org/T63780) (owner: 10Herron) [07:55:16] (03PS1) 10Muehlenhoff: Add AAAA record for labmon1002 [dns] - 10https://gerrit.wikimedia.org/r/475954 [07:55:57] (03PS1) 10Vgutierrez: certcentral: Provide TLS certificates for apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/475955 (https://phabricator.wikimedia.org/T207050) [07:55:59] (03PS1) 10Vgutierrez: installserver: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475956 (https://phabricator.wikimedia.org/T207050) [07:56:35] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Provide TLS certificates for apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/475955 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [07:57:04] (03CR) 10jerkins-bot: [V: 04-1] installserver: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475956 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [07:57:12] (03PS2) 10Vgutierrez: certcentral: Provide TLS certificates for apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/475955 (https://phabricator.wikimedia.org/T207050) [07:57:24] (03PS2) 10Vgutierrez: installserver: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475956 (https://phabricator.wikimedia.org/T207050) [07:57:26] (03CR) 10Filippo Giunchedi: "Too bad we can't match patterns on rsyslog lookup :( good enough for now!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/475879 (https://phabricator.wikimedia.org/T210455) (owner: 10Herron) [07:59:03] (03CR) 10Vgutierrez: [C: 032] certcentral: Provide TLS certificates for apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/475955 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [08:03:03] (03CR) 10Filippo Giunchedi: [C: 031] Add AAAA record for labmon1002 [dns] - 10https://gerrit.wikimedia.org/r/475954 (owner: 10Muehlenhoff) [08:08:04] (03PS1) 10Muehlenhoff: Remove IRCDStats collector [puppet] - 10https://gerrit.wikimedia.org/r/475957 (https://phabricator.wikimedia.org/T183454) [08:15:21] (03CR) 10Vgutierrez: [C: 032] installserver: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/475956 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [08:15:41] !log more weight to new ms-be hosts in codfw - T209395 [08:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:44] T209395: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 [08:21:38] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I generally dislike local/peer authentication due to the fact that it tightly couples the software talking to the database with the databa" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475093 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [08:28:20] (03PS1) 10Vgutierrez: install_server: Use certcentral managed TLS certificate for apt.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/475963 [08:29:50] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC at https://puppet-compiler.wmflabs.org/compiler1002/13708/ is nominal and expected. It touches a bit more services than usual, I 'll b" [puppet] - 10https://gerrit.wikimedia.org/r/426016 (https://phabricator.wikimedia.org/T192102) (owner: 10Alexandros Kosiaris) [08:29:57] (03PS3) 10Alexandros Kosiaris: service::uwsgi: Stop using --autoload in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/426016 (https://phabricator.wikimedia.org/T192102) [08:34:46] (03PS1) 10Elukey: druid: direct metrics events to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/475964 [08:37:37] (03CR) 10Elukey: [C: 032] druid: direct metrics events to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/475964 (owner: 10Elukey) [08:39:44] (03PS2) 10Vgutierrez: install_server: Use certcentral managed TLS certificate for apt.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/475963 [08:39:46] (03PS1) 10Vgutierrez: install_server: Remove old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475965 (https://phabricator.wikimedia.org/T207050) [08:41:19] !log Use a TLS certificate managed by certcentral in apt.wm.o - T207050 [08:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:23] T207050: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 [08:41:32] (03CR) 10Vgutierrez: [C: 032] install_server: Use certcentral managed TLS certificate for apt.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/475963 (owner: 10Vgutierrez) [08:41:44] (03PS3) 10Vgutierrez: install_server: Use certcentral managed TLS certificate for apt.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/475963 [08:42:31] 10Operations, 10ops-codfw, 10netops: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10Peachey88) [08:43:29] !log roll restart of all druid daemons on druid100[1-6] for openjdk-8 upgrades [08:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:52] openssl s_client -servername apt.wikimedia.org -connect apt.wikimedia.org:443 2>/dev/null | openssl x509 -noout -dates [08:47:52] notBefore=Nov 27 07:00:56 2018 GMT [08:47:56] looking good ^^ [08:52:08] (03PS1) 10Muehlenhoff: Absent Redis Diamond collector on Redis slaves [puppet] - 10https://gerrit.wikimedia.org/r/475967 (https://phabricator.wikimedia.org/T183454) [08:56:22] (03PS1) 10Vgutierrez: certcentral: Provide dhparam.pem when deploying certificates [puppet] - 10https://gerrit.wikimedia.org/r/475968 (https://phabricator.wikimedia.org/T207050) [08:57:09] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Provide dhparam.pem when deploying certificates [puppet] - 10https://gerrit.wikimedia.org/r/475968 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [08:59:05] (03PS1) 10Muehlenhoff: Absent Redis Diamond collector on Redis masters [puppet] - 10https://gerrit.wikimedia.org/r/475969 (https://phabricator.wikimedia.org/T183454) [09:00:19] !log executing schema change on db2035 for s2 (T85757) [09:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:24] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:04:24] (03CR) 10DCausse: [C: 04-1] "cross cluster search config must be setup on every elastic instances" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [09:05:44] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:08:57] (03PS2) 10Vgutierrez: certcentral: Handle common requirements for certcentral clients [puppet] - 10https://gerrit.wikimedia.org/r/475968 (https://phabricator.wikimedia.org/T207050) [09:09:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [09:09:49] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Handle common requirements for certcentral clients [puppet] - 10https://gerrit.wikimedia.org/r/475968 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [09:10:39] oh.. our current letsencrypt module isn't kosher [09:14:01] (03CR) 10Banyek: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/475951 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [09:14:15] (03CR) 10Marostegui: [C: 032] mariadb: Provision pc1007-pc1010 [puppet] - 10https://gerrit.wikimedia.org/r/475951 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [09:14:22] (03PS4) 10Marostegui: mariadb: Provision pc1007-pc1010 [puppet] - 10https://gerrit.wikimedia.org/r/475951 (https://phabricator.wikimedia.org/T208383) [09:15:53] I didn't logged but codfw slave lag is expected [09:16:22] ACKNOWLEDGEMENT - MariaDB Slave Lag: s2 on db2056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 379.24 seconds Banyek T85757 [09:17:27] banyek: why not silencing all the codfw hosts before the alter? [09:18:23] I have no reason, it was a mistake [09:18:28] PROBLEM - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 391.34 seconds [09:18:44] Sure no worries, just asking if there was an specific reason not to ;) [09:19:07] ACKNOWLEDGEMENT - MariaDB Slave Lag: s2 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 391.34 seconds Banyek T85757 [09:19:51] there was no lag on s6, it just wasn't in my mind it will happen now, on next iteration I'll fix this [09:20:03] banyek: but these wikis are bigger [09:20:32] "a good priest learns until the grave" [09:20:40] RECOVERY - MariaDB Slave Lag: s2 on db2088 is OK: OK slave_sql_lag Replication lag: 0.20 seconds [09:24:34] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475971 (https://phabricator.wikimedia.org/T208383) [09:29:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475971 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [09:32:50] (03CR) 10Filippo Giunchedi: [C: 031] Remove IRCDStats collector [puppet] - 10https://gerrit.wikimedia.org/r/475957 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [09:33:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1004 - T208383 (duration: 00m 46s) [09:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:44] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [09:33:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475971 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [09:33:48] (03CR) 10Filippo Giunchedi: [C: 031] Absent Redis Diamond collector on Redis slaves [puppet] - 10https://gerrit.wikimedia.org/r/475967 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [09:34:54] (03CR) 10Filippo Giunchedi: [C: 031] Absent Redis Diamond collector on Redis masters [puppet] - 10https://gerrit.wikimedia.org/r/475969 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [09:35:19] !log Stop MySQL on pc1004 to clone pc1010 - T208383 [09:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:46] (03PS3) 10Vgutierrez: certcentral: Handle common requirements for certcentral clients [puppet] - 10https://gerrit.wikimedia.org/r/475968 (https://phabricator.wikimedia.org/T207050) [09:38:10] _joe_: good morning. docker-pkg received some long standing hacks I had locally :) https://gerrit.wikimedia.org/r/#/q/project:operations/docker-images/docker-pkg+is:open [09:38:31] <_joe_> hashar: I kinda saw, I'll take a look this morning :) [09:38:46] <_joe_> I'm running benchmarks on php7, so I have dead time :P [09:38:46] PROBLEM - MariaDB Slave IO: pc1 on pc2007 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@pc1004.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on pc1004.eqiad.wmnet (111 Connection refused) [09:38:50] ^ that is me [09:38:52] specially, one to allow dumping log to stdout (--info) and one that logs docker build output at info/error level [09:38:53] PROBLEM - MariaDB Slave IO: pc1 on pc2010 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@pc1004.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on pc1004.eqiad.wmnet (111 Connection refused) [09:39:15] ^ and that [09:39:18] the build log desperately lacked the commands output :) [09:40:33] _joe_: some of my patches lack tests, I gave it a try yesterday evening but eventually gave up :( And probably the README could use some additions [09:40:56] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475971 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [09:41:35] <_joe_> hashar: well if it's just logging, no tests is kinda ok, but for others, expect -1s :P [09:42:21] (03CR) 10Filippo Giunchedi: "Is the idea to keep maintaining the package ourselves or switch to >=2.7 once Debian has the updated package? I'm asking because if it is " (031 comment) [debs/docker-distribution] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/475792 (https://phabricator.wikimedia.org/T210071) (owner: 10Fsero) [09:42:25] <_joe_> also TIL urls for the old ui don't work in the new one [09:43:13] (03CR) 10Giuseppe Lavagetto: [C: 032] tox: add 'venv' to run any command in a venv [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475479 (owner: 10Hashar) [09:43:35] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix invalid escape sequence in a regular expression [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475480 (owner: 10Hashar) [09:44:46] (03Merged) 10jenkins-bot: Fix invalid escape sequence in a regular expression [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475480 (owner: 10Hashar) [09:44:48] (03Merged) 10jenkins-bot: tox: add 'venv' to run any command in a venv [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475479 (owner: 10Hashar) [09:45:16] (03CR) 10jenkins-bot: Fix invalid escape sequence in a regular expression [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475480 (owner: 10Hashar) [09:45:39] (03CR) 10jenkins-bot: tox: add 'venv' to run any command in a venv [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475479 (owner: 10Hashar) [09:47:28] 10Operations, 10Deployments: Make failures on foreachwiki more obvious the deployer - https://phabricator.wikimedia.org/T210474 (10Volans) [09:49:28] (03PS4) 10Vgutierrez: certcentral: Handle common requirements for certcentral clients [puppet] - 10https://gerrit.wikimedia.org/r/475968 (https://phabricator.wikimedia.org/T207050) [09:49:30] (03PS2) 10Vgutierrez: install_server: Remove old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475965 (https://phabricator.wikimedia.org/T207050) [09:50:04] (03PS3) 10Vgutierrez: install_server: Remove old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475965 (https://phabricator.wikimedia.org/T207050) [09:51:29] 10Operations, 10SRE-Access-Requests: access request for Jeena Huneidi (deployment, conint-admins, contint-docker) - https://phabricator.wikimedia.org/T210027 (10jcrespo) [09:53:41] 10Operations, 10Analytics, 10Performance-Team, 10Traffic: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Gilles) [09:53:47] (03CR) 10Gehel: [C: 04-1] "Looks reasonable in principle." [puppet] - 10https://gerrit.wikimedia.org/r/475942 (https://phabricator.wikimedia.org/T210265) (owner: 10Mathew.onipe) [09:54:58] (03PS1) 10Odder: Add localised logos for the Minangkabau Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475974 [09:57:31] !log depooling db1076 due a schema change (T85757) [09:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:35] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:58:25] (03PS5) 10Vgutierrez: certcentral: Handle common requirements for certcentral clients [puppet] - 10https://gerrit.wikimedia.org/r/475968 (https://phabricator.wikimedia.org/T207050) [09:58:27] (03PS4) 10Vgutierrez: install_server: Remove old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475965 (https://phabricator.wikimedia.org/T207050) [09:59:02] (03CR) 10Banyek: [C: 032] mariadb: depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475740 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:59:56] (03PS1) 10Marostegui: filtered_tables.txt: Remove ss_total_views column [puppet] - 10https://gerrit.wikimedia.org/r/475975 [10:00:39] (03CR) 10jerkins-bot: [V: 04-1] filtered_tables.txt: Remove ss_total_views column [puppet] - 10https://gerrit.wikimedia.org/r/475975 (owner: 10Marostegui) [10:00:47] (03PS3) 10Banyek: mariadb: depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475740 (https://phabricator.wikimedia.org/T85757) [10:00:51] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475740 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:02:40] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: depool db1076 (duration: 00m 46s) [10:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:43] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:04:06] (03PS2) 10Marostegui: filtered_tables.txt: Remove ss_total_views column [puppet] - 10https://gerrit.wikimedia.org/r/475975 (https://phabricator.wikimedia.org/T86339) [10:04:56] (03PS4) 10Filippo Giunchedi: WIP rsyslog: udp input json_lines shim [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851) [10:06:27] _joe_: for docker-pkg tests, I could use a hand. My mind is puzzled by the various wrapping class ( DockerImage ImageFSM etc) [10:06:42] !log executing schema change on db1076 (T85757) [10:06:43] (03PS2) 10Odder: Add localised logos for the Minangkabau Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475974 (https://phabricator.wikimedia.org/T210387) [10:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:35] (03PS1) 10Elukey: Add analytics-admins to analytics-tools hosts [puppet] - 10https://gerrit.wikimedia.org/r/475976 [10:07:37] (03CR) 10jenkins-bot: mariadb: depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475740 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:07:41] <_joe_> hashar: sure [10:08:10] (03CR) 10Giuseppe Lavagetto: [C: 032] tests: replace deprecated assertEquals [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475734 (owner: 10Hashar) [10:08:28] (03CR) 10Elukey: [C: 032] Add analytics-admins to analytics-tools hosts [puppet] - 10https://gerrit.wikimedia.org/r/475976 (owner: 10Elukey) [10:08:39] (03CR) 10jenkins-bot: tests: replace deprecated assertEquals [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475734 (owner: 10Hashar) [10:09:36] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475352 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [10:12:58] (03PS2) 10Gilles: Add HTTP/2 priorities test to speed tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475345 (https://phabricator.wikimedia.org/T210141) [10:13:08] !log repooling db1076 after schema change (T85757) [10:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:12] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:13:24] (03PS1) 10Banyek: Revert "mariadb: depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475977 [10:13:58] (03CR) 10jerkins-bot: [V: 04-1] Add HTTP/2 priorities test to speed tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475345 (https://phabricator.wikimedia.org/T210141) (owner: 10Gilles) [10:15:50] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475977 (owner: 10Banyek) [10:16:00] (03CR) 10Banyek: [V: 032 C: 032] Revert "mariadb: depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475977 (owner: 10Banyek) [10:16:53] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475977 (owner: 10Banyek) [10:17:44] (03CR) 10Amire80: [C: 031] Add localised logos for the Minangkabau Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475974 (https://phabricator.wikimedia.org/T210387) (owner: 10Odder) [10:18:12] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1076 (duration: 00m 45s) [10:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:17] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:19:24] (03PS1) 10Vgutierrez: tendril: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475978 (https://phabricator.wikimedia.org/T207050) [10:19:26] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think we should go ahead and make --pull the default behaviour, and bump versions accordingly to the introduction of a breaking change w" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475843 (https://phabricator.wikimedia.org/T210438) (owner: 10Hashar) [10:19:28] (03PS1) 10Vgutierrez: librenms: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475979 (https://phabricator.wikimedia.org/T207050) [10:19:30] (03PS1) 10Vgutierrez: netbox: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475980 (https://phabricator.wikimedia.org/T207050) [10:19:32] (03PS1) 10Vgutierrez: archiva: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475981 (https://phabricator.wikimedia.org/T207050) [10:19:35] (03PS3) 10Gilles: Add HTTP/2 priorities test to speed tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475345 (https://phabricator.wikimedia.org/T210141) [10:19:48] <_joe_> vgutierrez: s/^es/s/g :P [10:19:54] !log depooling db1090 due a schema change (T85757) [10:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:14] (03CR) 10jerkins-bot: [V: 04-1] tendril: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475978 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [10:20:21] (03CR) 10Banyek: [C: 032] mariadb: depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475741 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:20:33] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 523 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen [10:20:35] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475741 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:20:48] (03CR) 10jerkins-bot: [V: 04-1] librenms: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475979 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [10:20:57] (03CR) 10jenkins-bot: Revert "mariadb: depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475977 (owner: 10Banyek) [10:21:45] (03CR) 10Fsero: "> Patch Set 1:" (031 comment) [debs/docker-distribution] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/475792 (https://phabricator.wikimedia.org/T210071) (owner: 10Fsero) [10:22:34] (03CR) 10jerkins-bot: [V: 04-1] archiva: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475981 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [10:23:36] (03PS6) 10Banyek: mariadb: depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475741 (https://phabricator.wikimedia.org/T85757) [10:23:40] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475741 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:24:18] (03PS3) 10Alexandros Kosiaris: Make restbase active/active [puppet] - 10https://gerrit.wikimedia.org/r/467742 [10:24:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Make restbase active/active [puppet] - 10https://gerrit.wikimedia.org/r/467742 (owner: 10Alexandros Kosiaris) [10:24:30] (03PS7) 10Muehlenhoff: Script to generate service principals/keytabs (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/470566 [10:24:42] (03Merged) 10jenkins-bot: mariadb: depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475741 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:26:13] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: depool db1090 (duration: 00m 46s) [10:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:16] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:26:24] 10Operations: Audit "misc" cluster hosts - https://phabricator.wikimedia.org/T210486 (10fgiunchedi) [10:26:29] (03PS15) 10Arturo Borrero Gonzalez: openstack: rearrange repos, packages and pinnings [puppet] - 10https://gerrit.wikimedia.org/r/475326 (https://phabricator.wikimedia.org/T209948) [10:29:40] (03PS4) 10Gilles: Add HTTP/2 priorities test to speed tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475345 (https://phabricator.wikimedia.org/T210141) [10:30:31] !log T209948 schedule 2h icinga downtime in all WMCS hw servers [10:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:34] T209948: CloudVPS: puppet organization for openstack server/client packages, repos and pinning - https://phabricator.wikimedia.org/T209948 [10:32:23] (03PS1) 10Elukey: Allow analytics-admins to restart daemons with systemctl [puppet] - 10https://gerrit.wikimedia.org/r/475984 [10:32:35] (03CR) 10Gilles: [C: 032] Add HTTP/2 priorities test to speed tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475345 (https://phabricator.wikimedia.org/T210141) (owner: 10Gilles) [10:33:54] (03Merged) 10jenkins-bot: Add HTTP/2 priorities test to speed tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475345 (https://phabricator.wikimedia.org/T210141) (owner: 10Gilles) [10:34:09] (03CR) 10jenkins-bot: mariadb: depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475741 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:34:11] (03CR) 10jenkins-bot: Add HTTP/2 priorities test to speed tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475345 (https://phabricator.wikimedia.org/T210141) (owner: 10Gilles) [10:36:17] !log T209948 disable puppet in all WMCS hw servers [10:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:20] T209948: CloudVPS: puppet organization for openstack server/client packages, repos and pinning - https://phabricator.wikimedia.org/T209948 [10:36:54] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: rearrange repos, packages and pinnings [puppet] - 10https://gerrit.wikimedia.org/r/475326 (https://phabricator.wikimedia.org/T209948) (owner: 10Arturo Borrero Gonzalez) [10:41:07] (03CR) 10Muehlenhoff: [C: 032] Add AAAA record for labmon1002 [dns] - 10https://gerrit.wikimedia.org/r/475954 (owner: 10Muehlenhoff) [10:42:00] !log gilles@deploy1001 Synchronized docroot/wikipedia.org/speed-tests/http2priorities: T210141 HTTP/2 prioritie speed test (duration: 00m 47s) [10:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:04] T210141: Test our production stack's HTTP/2 priority support - https://phabricator.wikimedia.org/T210141 [10:42:52] (03PS1) 10Jcrespo: admin: Add access to Jeena Huneidi to the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/475986 (https://phabricator.wikimedia.org/T210027) [10:43:34] (03PS2) 10Jcrespo: admin: Add Jeena Huneidi access to the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/475986 (https://phabricator.wikimedia.org/T210027) [10:45:38] (03CR) 10Giuseppe Lavagetto: [C: 032] cli: allow INFO logging to stdout [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475758 (owner: 10Hashar) [10:46:24] (03Merged) 10jenkins-bot: cli: allow INFO logging to stdout [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475758 (owner: 10Hashar) [10:46:41] _joe_: thanks :) [10:46:51] (03CR) 10jenkins-bot: cli: allow INFO logging to stdout [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475758 (owner: 10Hashar) [10:49:51] (03CR) 10Jcrespo: "Please verify the information is correct and double check the public ssh key is correct." [puppet] - 10https://gerrit.wikimedia.org/r/475986 (https://phabricator.wikimedia.org/T210027) (owner: 10Jcrespo) [10:51:36] !log repooling db1090 due a schema change (T85757) [10:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:39] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:51:46] !log repooling db1090 after schema change (T85757) [10:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:02] (03PS1) 10Banyek: Revert "mariadb: depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475988 [10:53:35] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475988 (owner: 10Banyek) [10:53:45] (03CR) 10Banyek: [V: 032 C: 032] Revert "mariadb: depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475988 (owner: 10Banyek) [10:54:49] (03PS3) 10Muehlenhoff: Remove Diamond from DB roles [puppet] - 10https://gerrit.wikimedia.org/r/467264 (https://phabricator.wikimedia.org/T183454) [10:55:24] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1090 (duration: 00m 46s) [10:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:25] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from DB roles [puppet] - 10https://gerrit.wikimedia.org/r/467264 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [10:59:04] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I like the idea (and thanks for taking the time to do it!), but I'd like to see my comments addressed." (033 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475779 (owner: 10Hashar) [10:59:35] !log executing schema change on db1095 (T85757) [10:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:38] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:00:29] (host not in service no depooling needed) [11:00:48] (03CR) 10jenkins-bot: Revert "mariadb: depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475988 (owner: 10Banyek) [11:01:56] (03CR) 10Giuseppe Lavagetto: [C: 032] tox: allow passing options to pytest environments [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475885 (owner: 10Hashar) [11:05:24] 10Operations, 10SRE-Access-Requests: Requesting access to Jupyter notebook / analytics-privatedata-users for jgleeson - https://phabricator.wikimedia.org/T208432 (10jcrespo) [11:06:30] PROBLEM - Check systemd state on db2071 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:07:22] (03CR) 10Volans: "Code works and looks mostly ok apart some complex import-based logic. See comments inline." (038 comments) [software] - 10https://gerrit.wikimedia.org/r/475933 (owner: 10Faidon Liambotis) [11:07:38] RECOVERY - Check systemd state on db2071 is OK: OK - running: The system is fully operational [11:09:24] PROBLEM - Check systemd state on db2082 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:09:54] (03PS1) 10Arturo Borrero Gonzalez: openstack: relax dependency on virtual-mysql-client [puppet] - 10https://gerrit.wikimedia.org/r/475990 (https://phabricator.wikimedia.org/T209948) [11:10:36] RECOVERY - Check systemd state on db2082 is OK: OK - running: The system is fully operational [11:11:46] (03CR) 10DCausse: [C: 04-1] elasticsearch: configure LVS endpoint for new codfw clusters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) (owner: 10Gehel) [11:12:43] (03PS2) 10Hashar: tox: allow passing options to pytest environments [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475885 [11:13:12] 10Operations, 10SRE-Access-Requests: Requesting access to Jupyter notebook / analytics-privatedata-users for jgleeson - https://phabricator.wikimedia.org/T208432 (10jcrespo) This now can proceed after the 3 business days wait, @jgleeson please be around on Friday to test the access. As far as I can see, WMF LD... [11:13:24] 10Operations, 10SRE-Access-Requests: Requesting access to Jupyter notebook / analytics-privatedata-users for jgleeson - https://phabricator.wikimedia.org/T208432 (10jcrespo) a:03jcrespo [11:14:08] PROBLEM - Check systemd state on db1061 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:14:21] (03CR) 10Hashar: "Gerrit rejected CI request to get the change merged. That is because the repository in Gerrit does not allow content merge and tox.ini got" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475885 (owner: 10Hashar) [11:14:55] (03PS2) 10Arturo Borrero Gonzalez: openstack: relax dependency on virtual-mysql-client [puppet] - 10https://gerrit.wikimedia.org/r/475990 (https://phabricator.wikimedia.org/T209948) [11:15:03] (03CR) 10DCausse: [C: 04-1] "and should you add something in conftool-data/service/services.yaml as well ?" [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) (owner: 10Gehel) [11:15:14] PROBLEM - Check systemd state on db1116 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:15:37] (03CR) 10Alex Monk: [C: 04-1] "Per discussion on IRC, change the commit message to indicate we're not sure why this is here but it was included in the old puppetisation." [puppet] - 10https://gerrit.wikimedia.org/r/475968 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [11:16:24] RECOVERY - Check systemd state on db1116 is OK: OK - running: The system is fully operational [11:16:30] RECOVERY - Check systemd state on db1061 is OK: OK - running: The system is fully operational [11:18:41] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compilation is correct: https://puppet-compiler.wmflabs.org/compiler1002/13722/" [puppet] - 10https://gerrit.wikimedia.org/r/475990 (https://phabricator.wikimedia.org/T209948) (owner: 10Arturo Borrero Gonzalez) [11:22:49] (03PS2) 10Fsero: buster package modified to customize it for WMF and for build 2.7 [debs/docker-distribution] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/475792 (https://phabricator.wikimedia.org/T210071) [11:23:22] PROBLEM - Check systemd state on db1102 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:23:26] PROBLEM - puppet last run on db1102 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [11:23:40] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: fix path to ubuntu cloud repo key [puppet] - 10https://gerrit.wikimedia.org/r/475991 (https://phabricator.wikimedia.org/T209948) [11:24:19] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: serverpackages: fix path to ubuntu cloud repo key [puppet] - 10https://gerrit.wikimedia.org/r/475991 (https://phabricator.wikimedia.org/T209948) (owner: 10Arturo Borrero Gonzalez) [11:24:32] RECOVERY - Check systemd state on db1102 is OK: OK - running: The system is fully operational [11:28:36] RECOVERY - puppet last run on db1102 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:28:55] !log depooling db1103 due a schema change (T85757) [11:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:01] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:29:07] (03PS3) 10Banyek: mariadb: depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475742 (https://phabricator.wikimedia.org/T85757) [11:30:21] (03CR) 10Banyek: [C: 032] mariadb: depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475742 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [11:30:39] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475742 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [11:30:59] (03PS2) 10Gehel: elasticsearch: configure LVS endpoint for new codfw clusters [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) [11:32:09] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: depool db1103 (duration: 00m 46s) [11:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:14] (03PS3) 10Gehel: elasticsearch: configure LVS endpoint for new codfw clusters [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) [11:32:39] !log executing schema change on db1103:3312 (T85757) [11:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:25] (03CR) 10Gehel: "> Patch Set 1:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) (owner: 10Gehel) [11:34:40] (03PS1) 10Ema: cp1008: point varnish-fe to ATS host [puppet] - 10https://gerrit.wikimedia.org/r/475992 (https://phabricator.wikimedia.org/T210141) [11:34:59] (03CR) 10jenkins-bot: mariadb: depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475742 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [11:37:47] (03CR) 10Ema: [C: 032] cp1008: point varnish-fe to ATS host [puppet] - 10https://gerrit.wikimedia.org/r/475992 (https://phabricator.wikimedia.org/T210141) (owner: 10Ema) [11:43:44] (03PS1) 10Jcrespo: admin: Add Mvolz to ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/475995 (https://phabricator.wikimedia.org/T209901) [11:50:28] !log repooling db1103 after schema change (T85757) [11:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:31] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:50:38] (03PS1) 10Banyek: Revert "mariadb: depool db1103" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475997 [11:54:29] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1103" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475997 (owner: 10Banyek) [11:54:31] (03CR) 10Banyek: [V: 032 C: 032] Revert "mariadb: depool db1103" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475997 (owner: 10Banyek) [11:56:12] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1103 (duration: 00m 46s) [11:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:16] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:57:08] (03PS1) 10Jcrespo: admin: Clarify what the absent group does [puppet] - 10https://gerrit.wikimedia.org/r/475998 [11:59:48] (03CR) 10Muehlenhoff: [C: 031] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/475998 (owner: 10Jcrespo) [12:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181127T1200). [12:00:05] odder and mobrovac: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:24] thnx jouncebot but i already have a tshirt [12:01:47] (03PS2) 10Jcrespo: admin: Add Mvolz to ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/475995 (https://phabricator.wikimedia.org/T209901) [12:02:39] (03CR) 10jenkins-bot: Revert "mariadb: depool db1103" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475997 (owner: 10Banyek) [12:03:00] (03CR) 10Giuseppe Lavagetto: "If you want to add conftool data and the lvs ip in one go, you also need to have servers pooled in the cluster, or pybal will find the ser" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) (owner: 10Gehel) [12:03:16] is somebody swatting today or what? [12:03:18] hashar: ? [12:03:54] (03CR) 10Jcrespo: [C: 032] admin: Add Mvolz to ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/475995 (https://phabricator.wikimedia.org/T209901) (owner: 10Jcrespo) [12:06:12] (03PS2) 10Jcrespo: admin: Clarify what the absent group does [puppet] - 10https://gerrit.wikimedia.org/r/475998 [12:06:20] (03CR) 10Faidon Liambotis: Initial commit of quotervwr (038 comments) [software] - 10https://gerrit.wikimedia.org/r/475933 (owner: 10Faidon Liambotis) [12:06:25] (03PS2) 10Faidon Liambotis: Initial commit of quotereviewer [software] - 10https://gerrit.wikimedia.org/r/475933 [12:06:58] ok, i'm proceeding with my own changes as they cannot be tested anyway [12:07:23] mobrovac: oops, I forgot about swat today :) [12:07:32] but feel free to deploy your changes [12:07:41] banyek: i saw you pooling and depooling servers, are you done? can i take over deployments? [12:07:43] kk thnz zeljkof [12:07:46] odder: around for swat? [12:08:09] sure, you can proceed, I grab some food, and continue after the train finished [12:08:14] (03CR) 10Jcrespo: [C: 032] admin: Clarify what the absent group does [puppet] - 10https://gerrit.wikimedia.org/r/475998 (owner: 10Jcrespo) [12:08:16] moborovac^ [12:08:28] kk thnx banyek [12:08:52] (03PS3) 10Mobrovac: RunSingleJob: Check that JobExecutor has been loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474885 (https://phabricator.wikimedia.org/T208922) [12:09:05] mobrovac: let me know when you're done, looks like there is just one more patch (logos) to deploy [12:09:15] or feel free to deploy it yourself :) [12:09:27] k will ping you zeljkof :) [12:09:38] oh the config change? [12:09:42] yeah i can do that one too [12:09:55] but odder hasn't replied [12:10:01] so let's wait if they are around [12:10:05] if not, we should skip it [12:12:14] Mornin' everyone [12:12:43] Just throwing https://gerrit.wikimedia.org/r/#/c/475974/ out there [12:12:55] If people are deploying things in this window [12:12:58] hmmmmm [12:12:59] (03PS1) 10Gilles: Add variant of HTTP/2 priorities test pointing to upload.* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476000 (https://phabricator.wikimedia.org/T210141) [12:13:04] zeljkof: Your branch is ahead of 'origin/wmf/1.33.0-wmf.4' by 5 commits [12:13:09] do we know why? [12:13:12] (03PS2) 10Gilles: Add variant of HTTP/2 priorities test pointing to upload.* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476000 (https://phabricator.wikimedia.org/T210141) [12:13:22] ah we do [12:13:28] sec patches [12:13:33] k [12:13:46] mobrovac: hashar is on train, but yes, probably security patches [12:13:56] odder: yup, we'll get to yours in a bit [12:14:08] no worries mobrovac, got here a little late [12:14:33] let me know when you guys are done with the SWAT window, I have a tiny thing to deploy https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/476000/ [12:15:40] k gilles, ETA 20 mins due to jenkins being slow [12:15:43] (03PS1) 10Muehlenhoff: Remove Diamond on additional DB roles [puppet] - 10https://gerrit.wikimedia.org/r/476001 [12:15:51] what else is new? :p [12:16:02] :D [12:16:20] (03CR) 10jerkins-bot: [V: 04-1] Remove Diamond on additional DB roles [puppet] - 10https://gerrit.wikimedia.org/r/476001 (owner: 10Muehlenhoff) [12:16:51] (03CR) 10Mobrovac: [C: 032] "1 + 1 + 1 == 2 :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474885 (https://phabricator.wikimedia.org/T208922) (owner: 10Mobrovac) [12:17:57] (03Merged) 10jenkins-bot: RunSingleJob: Check that JobExecutor has been loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474885 (https://phabricator.wikimedia.org/T208922) (owner: 10Mobrovac) [12:19:42] (03PS3) 10Mobrovac: Add localised logos for the Minangkabau Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475974 (https://phabricator.wikimedia.org/T210387) (owner: 10Odder) [12:19:54] !log mobrovac@deploy1001 Synchronized rpc/RunSingleJob.php: RunSingleJob: Check that JobExecutor has been loaded - T208922 (duration: 00m 47s) [12:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:59] T208922: PHP Fatal Error: Class undefined: JobExecutor (jobrunners try to run labswiki jobs) - https://phabricator.wikimedia.org/T208922 [12:20:35] zeljkof: jenkins seems to be blocked on my core patch, so i'll proceed with odder's while waiting for it [12:20:41] odder: yours is next [12:21:04] (03CR) 10Mobrovac: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475974 (https://phabricator.wikimedia.org/T210387) (owner: 10Odder) [12:22:10] (03Merged) 10jenkins-bot: Add localised logos for the Minangkabau Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475974 (https://phabricator.wikimedia.org/T210387) (owner: 10Odder) [12:22:57] odder: is it safe to sync the files one by one? is there a preferred order? [12:23:03] i don't want to do a full scap for this [12:23:17] you can sync-file the folder [12:23:43] and it should do the right thing (tm) [12:23:54] ha! [12:24:00] i like the use of "should" there [12:24:04] :) [12:24:07] ok let's try that then [12:24:16] I've used it twice, from my limited experience it works fine [12:25:14] mobrovac: I'm not sure, the only change from the standard way that I know is that you have to purge the URLs in Varnish [12:25:18] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Purging [12:25:49] !log mobrovac@deploy1001 Synchronized static/images/project-logos: Add localised logos for the Minangkabau Wikipedia - T210387 (duration: 00m 47s) [12:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:53] T210387: Change Minang Wikipedia logo - https://phabricator.wikimedia.org/T210387 [12:26:25] odder: have the url that needs purging handy? [12:26:46] i just need the domain really [12:27:50] the purge has to be done on enwiki [12:28:05] static is shared [12:28:25] across all domains. that doc explains it above the command [12:28:55] (03PS1) 10Arturo Borrero Gonzalez: toolforge: introduce bastion systemd-based resource control [puppet] - 10https://gerrit.wikimedia.org/r/476003 (https://phabricator.wikimedia.org/T210098) [12:29:00] echo "https://en.wikipedia.org/static/images/project-logos/minwiki.png" | mwscript purgeList.php [12:29:03] and so on [12:29:05] done [12:29:08] odder: please check [12:29:13] (03CR) 10jenkins-bot: RunSingleJob: Check that JobExecutor has been loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474885 (https://phabricator.wikimedia.org/T208922) (owner: 10Mobrovac) [12:29:15] (03CR) 10jenkins-bot: Add localised logos for the Minangkabau Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475974 (https://phabricator.wikimedia.org/T210387) (owner: 10Odder) [12:29:27] mobrovac: Everything looks in perfect order, thanks very much [12:29:30] gilles: you can go ahead with your config change, i'm still waiting on jenking for my core patch ... [12:29:35] thanks [12:29:40] (03CR) 10Gilles: [C: 032] Add variant of HTTP/2 priorities test pointing to upload.* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476000 (https://phabricator.wikimedia.org/T210141) (owner: 10Gilles) [12:30:43] (03Merged) 10jenkins-bot: Add variant of HTTP/2 priorities test pointing to upload.* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476000 (https://phabricator.wikimedia.org/T210141) (owner: 10Gilles) [12:31:02] (03PS3) 10Jcrespo: admin: Add Jeena Huneidi access to the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/475986 (https://phabricator.wikimedia.org/T210027) [12:31:04] (03PS1) 10Jcrespo: admin: Add jgleeson access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476004 (https://phabricator.wikimedia.org/T208432) [12:32:35] (03PS1) 10Ema: cp1008: move hiera settings to cache::canary role [puppet] - 10https://gerrit.wikimedia.org/r/476005 [12:32:55] !log gilles@deploy1001 Synchronized docroot/wikipedia.org/speed-tests/http2priorities/upload.wikimedia.org.html: T210141 Add variant of HTTP/2 priorities test pointing to upload (duration: 00m 46s) [12:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:58] T210141: Test our production stack's HTTP/2 priority support - https://phabricator.wikimedia.org/T210141 [12:33:37] (03CR) 10Jcrespo: "Please verify the information here is correct and the ssh key corresponds to Jack Gleeson" [puppet] - 10https://gerrit.wikimedia.org/r/476004 (https://phabricator.wikimedia.org/T208432) (owner: 10Jcrespo) [12:33:40] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toolforge: introduce bastion systemd-based resource control [puppet] - 10https://gerrit.wikimedia.org/r/476003 (https://phabricator.wikimedia.org/T210098) (owner: 10Arturo Borrero Gonzalez) [12:34:30] zeljkof: the verification failed due to an unrelated error, do i have your ok to force a V+2 ? [12:34:37] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/475893/ [12:34:53] mobrovac: I'm looking [12:34:55] the vendor test failed which has nothing to do with the change [12:35:14] hashar: do you have opinions? :) ^ [12:35:15] (03PS1) 10Elukey: profile::hive::client: fix typo in parameter [puppet] - 10https://gerrit.wikimedia.org/r/476006 [12:35:21] gilles: i assume you are done? [12:35:36] mobrovac: yes, thanks [12:35:40] k [12:36:13] mobrovac: yes, looks like npm package cache problem [12:36:19] k [12:36:21] thnx [12:36:32] if you think the commit is fine, merge it [12:36:51] it is, and it's needed asap :) [12:37:03] (03CR) 10Elukey: [C: 032] profile::hive::client: fix typo in parameter [puppet] - 10https://gerrit.wikimedia.org/r/476006 (owner: 10Elukey) [12:37:10] (03PS2) 10Elukey: profile::hive::client: fix typo in parameter [puppet] - 10https://gerrit.wikimedia.org/r/476006 [12:37:13] (03CR) 10Elukey: [V: 032 C: 032] profile::hive::client: fix typo in parameter [puppet] - 10https://gerrit.wikimedia.org/r/476006 (owner: 10Elukey) [12:37:55] (03PS2) 10Muehlenhoff: Remove IRCDStats collector [puppet] - 10https://gerrit.wikimedia.org/r/475957 (https://phabricator.wikimedia.org/T183454) [12:39:26] (03PS2) 10Ema: cp1008: move hiera settings to cache::canary role [puppet] - 10https://gerrit.wikimedia.org/r/476005 [12:39:46] !log mobrovac@deploy1001 Synchronized php-1.33.0-wmf.4/includes/page/WikiPage.php: Convert $archivedRevisionCount to integer - T210013 T210451 (duration: 00m 47s) [12:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:51] T210013: EventBus extension started emitting rev_count as a string - https://phabricator.wikimedia.org/T210013 [12:39:52] T210451: Kafka eqiad.mediawiki.page-delete topic is empty - https://phabricator.wikimedia.org/T210451 [12:39:58] ok i'm done [12:40:00] swat is done too [12:41:57] 10Operations, 10Analytics, 10EventBus, 10WMF-JobQueue, and 5 others: Kafka eqiad.mediawiki.page-delete topic is empty - https://phabricator.wikimedia.org/T210451 (10mobrovac) The fix has been deployed, delete events should start flowing again, so resolving. Let's reopen the ticket if that does not occur. [12:42:01] (03CR) 10jenkins-bot: Add variant of HTTP/2 priorities test pointing to upload.* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476000 (https://phabricator.wikimedia.org/T210141) (owner: 10Gilles) [12:42:03] 10Operations, 10Analytics, 10EventBus, 10WMF-JobQueue, and 5 others: Kafka eqiad.mediawiki.page-delete topic is empty - https://phabricator.wikimedia.org/T210451 (10mobrovac) 05Open>03Resolved a:03mobrovac [12:42:21] (03CR) 10Muehlenhoff: [C: 032] Remove IRCDStats collector [puppet] - 10https://gerrit.wikimedia.org/r/475957 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [12:42:52] (03PS1) 10Arturo Borrero Gonzalez: toolforge: resource control: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/476007 (https://phabricator.wikimedia.org/T210098) [12:44:13] (03PS2) 10Arturo Borrero Gonzalez: toolforge: resource control: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/476007 (https://phabricator.wikimedia.org/T210098) [12:44:45] (03PS1) 10Giuseppe Lavagetto: puppet-merge: allow only showing diffs without merging [puppet] - 10https://gerrit.wikimedia.org/r/476008 [12:45:15] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) [12:45:19] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toolforge: resource control: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/476007 (https://phabricator.wikimedia.org/T210098) (owner: 10Arturo Borrero Gonzalez) [12:55:59] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Make AdvancedSearch default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475759 (https://phabricator.wikimedia.org/T207639) (owner: 10WMDE-Fisch) [12:56:39] (03PS1) 10GTirloni: prometheus: collect DRBD stats on WMCS storage servers [puppet] - 10https://gerrit.wikimedia.org/r/476009 (https://phabricator.wikimedia.org/T208446) [12:56:51] 10Operations, 10monitoring, 10User-CDanis: graph server temperature metrics - https://phabricator.wikimedia.org/T209863 (10CDanis) Late last week I figured out scraping aggregated data from Prometheus as a CSV and fed that into Plotly: https://plot.ly/~cdanis-wmf/1/#/ Still to do: - Join with node_hwmon... [12:59:24] zeljkof: sorry back from lunch [12:59:52] hashar: no problem, looks like everything is fine [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181127T1300) [13:00:15] zeljkof: the build fails due to some npmjs.org checksum mismatch [13:00:21] known bug somewhere :/ [13:01:22] (03CR) 10GTirloni: [C: 032] prometheus: collect DRBD stats on WMCS storage servers [puppet] - 10https://gerrit.wikimedia.org/r/476009 (https://phabricator.wikimedia.org/T208446) (owner: 10GTirloni) [13:01:28] (03PS2) 10GTirloni: prometheus: collect DRBD stats on WMCS storage servers [puppet] - 10https://gerrit.wikimedia.org/r/476009 (https://phabricator.wikimedia.org/T208446) [13:08:39] (03CR) 10Ema: [C: 04-1] "We need https://gerrit.wikimedia.org/r/c/operations/puppet/+/475500 to be merged first, otherwise cache::text::nodes defined in common/cac" [puppet] - 10https://gerrit.wikimedia.org/r/476005 (owner: 10Ema) [13:09:10] !log proton deploying 9efc07238f8dee62385799791718f4d57f754073 [13:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:21] !log depooling db1105 due a schema change (T85757) [13:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:24] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [13:14:26] (03CR) 10Banyek: [C: 032] mariadb: depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475743 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:14:44] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475743 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:14:58] (03PS1) 10Elukey: hue: add kerberos config support [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476011 [13:15:13] !log pmiazga@deploy1001 Started deploy [proton/deploy@9efc072]: Proton: Rewrite Queue to promise-way flow (T204055) [13:15:17] (03CR) 10jerkins-bot: [V: 04-1] hue: add kerberos config support [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476011 (owner: 10Elukey) [13:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:20] T204055: Consistently represent asynchronous code execution - https://phabricator.wikimedia.org/T204055 [13:17:00] (03PS5) 10Banyek: mariadb: depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475743 (https://phabricator.wikimedia.org/T85757) [13:17:05] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475743 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:17:12] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [13:17:56] !log pmiazga@deploy1001 Finished deploy [proton/deploy@9efc072]: Proton: Rewrite Queue to promise-way flow (T204055) (duration: 02m 43s) [13:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:10] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [13:18:22] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: depool db1105 (duration: 00m 46s) [13:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:42] !log executing schema change on db1105 (T85757) [13:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:05] not sure what's going on with the ulsfo/upload alert, but ulsfo is still depooled in DNS from circuit maintenance windows earlier today [13:19:21] (03PS2) 10Elukey: hue: add kerberos config support [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476011 [13:20:36] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13729/ - a wonderful noop" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476011 (owner: 10Elukey) [13:20:50] (03CR) 10Elukey: [V: 032 C: 032] hue: add kerberos config support [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476011 (owner: 10Elukey) [13:20:53] (03CR) 10jenkins-bot: mariadb: depool db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475743 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:22:15] (03PS1) 10Arturo Borrero Gonzalez: toolforge: bastion: split slice config for root/users [puppet] - 10https://gerrit.wikimedia.org/r/476012 (https://phabricator.wikimedia.org/T210098) [13:22:56] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [13:23:13] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toolforge: bastion: split slice config for root/users [puppet] - 10https://gerrit.wikimedia.org/r/476012 (https://phabricator.wikimedia.org/T210098) (owner: 10Arturo Borrero Gonzalez) [13:28:21] 10Operations, 10Icinga, 10decommission, 10monitoring: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10GTirloni) Just a heads up that einsteinium is still running Icinga and contacting servers: ` Nov 27 13:25:11 labstore1004 nrpe[18896]: Host 208.80.155.119 is not allowed to talk to us! N... [13:28:51] (03PS1) 10Elukey: profile::hue: add support for Kerberos [puppet] - 10https://gerrit.wikimedia.org/r/476013 [13:29:36] RECOVERY - MariaDB Slave IO: pc1 on pc2010 is OK: OK slave_io_state Slave_IO_Running: Yes [13:29:56] RECOVERY - MariaDB Slave IO: pc1 on pc2007 is OK: OK slave_io_state Slave_IO_Running: Yes [13:30:31] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13730/ - noop" [puppet] - 10https://gerrit.wikimedia.org/r/476013 (owner: 10Elukey) [13:32:20] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3fullscreenrefresh=1morgId=1 [13:32:30] (03PS2) 10Vgutierrez: tendril: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475978 (https://phabricator.wikimedia.org/T207050) [13:32:32] (03PS2) 10Vgutierrez: librenms: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475979 (https://phabricator.wikimedia.org/T207050) [13:32:34] (03PS2) 10Vgutierrez: netbox: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475980 (https://phabricator.wikimedia.org/T207050) [13:32:36] (03PS2) 10Vgutierrez: archiva: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475981 (https://phabricator.wikimedia.org/T207050) [13:33:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [13:34:38] !log Change pc2007 and pc2010 to replicate from pc1010 instead of from pc1004 - T208383 [13:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:41] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [13:40:06] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) @marostegui it's fixed! Sorry about that [13:40:50] (03CR) 10Vgutierrez: [C: 031] "pcc looks happy and shows the expected changes: https://puppet-compiler.wmflabs.org/compiler1002/13731/" [puppet] - 10https://gerrit.wikimedia.org/r/475978 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [13:41:27] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) Thanks for the fast response @Cmjohnson! Will you re-install it or should I? Thanks! [13:41:45] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by banyek on cumin1001.eqiad.wmnet for hosts: ` ['backup2001.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/... [13:42:11] its not me [13:42:59] (03CR) 10Vgutierrez: [C: 031] "pcc is happy and shows the expected changes: https://puppet-compiler.wmflabs.org/compiler1002/13732/" [puppet] - 10https://gerrit.wikimedia.org/r/475979 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [13:43:36] banyek: what do you mean is not you? [13:43:56] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10Banyek) >>! In T196477#4777495, @ops-monitoring-bot wrote: > Script wmf-auto-reimage was launched by banyek on cumin1001.eqiad.wmnet for hosts: > ` > ['backup2001.codfw.wmnet'] > ` > Th... [13:44:03] 14:41 `Script wmf-auto-reimage was launched by banyek on cumin1001.eqiad.wmnet` [13:44:09] this is not me [13:44:58] (03PS8) 10Muehlenhoff: Script to generate service principals/keytabs (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/470566 [13:46:55] (03CR) 10Filippo Giunchedi: [C: 031] "> Patch Set 1:" (031 comment) [debs/docker-distribution] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/475792 (https://phabricator.wikimedia.org/T210071) (owner: 10Fsero) [13:49:06] 10Operations, 10Puppet, 10Patch-For-Review: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648 (10akosiaris) [13:49:12] 10Operations, 10Puppet, 10Patch-For-Review: deprecate and remove --autoload in uwsgi puppet class - https://phabricator.wikimedia.org/T192102 (10akosiaris) 05Open>03Resolved a:03akosiaris Finally resolved [13:49:40] 10Operations, 10Puppet, 10Patch-For-Review: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648 (10akosiaris) 05Open>03Resolved a:03akosiaris Child task resolved, resolving this as well [13:50:21] (03PS1) 10Marostegui: pc1010: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/476014 (https://phabricator.wikimedia.org/T208383) [13:50:32] (03CR) 10Amire80: [C: 031] Add shnwiki to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473510 (https://phabricator.wikimedia.org/T206777) (owner: 10Reedy) [13:50:36] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) @marostegui if you don't mind can you do the reinstall. Thanks [13:52:09] (03PS1) 10Marostegui: db-eqiad.php: Pool pc1010 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476015 (https://phabricator.wikimedia.org/T208383) [13:52:38] (03CR) 10Marostegui: [C: 032] pc1010: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/476014 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:55:12] jouncebot: next [13:55:13] In 0 hour(s) and 4 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181127T1400) [13:56:00] (03CR) 10Alex Monk: "per latest discussion on IRC, the OCSP stapler will need this" [puppet] - 10https://gerrit.wikimedia.org/r/475968 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [13:56:02] (03CR) 10Banyek: [C: 031] "banyek@cumin1001:~ $ host 10.64.32.72" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476015 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:56:05] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) >>! In T207258#4777536, @Cmjohnson wrote: > @marostegui if you don't mind can you do the reinstall. Thanks Will do! Thank you! [13:56:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool pc1010 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476015 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:56:40] (03PS4) 10Gehel: elasticsearch: configure LVS endpoint for new codfw clusters [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) [13:57:24] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` pc1008.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201811271357_m... [13:57:36] (03CR) 10Gehel: elasticsearch: configure LVS endpoint for new codfw clusters (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/475753 (https://phabricator.wikimedia.org/T207195) (owner: 10Gehel) [13:58:00] (03Merged) 10jenkins-bot: db-eqiad.php: Pool pc1010 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476015 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [13:59:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool pc1010 in pc1 - T208383 (duration: 00m 46s) [13:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:08] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [14:00:04] hashar: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181127T1400). [14:00:26] !log repooling db1105 due a schema change (T85757) [14:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:29] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [14:00:50] (03PS1) 10GTirloni: wmcs: Exclude tracefs from check_disk [puppet] - 10https://gerrit.wikimedia.org/r/476016 (https://phabricator.wikimedia.org/T208465) [14:00:55] (03PS1) 10Banyek: Revert "mariadb: depool db1105" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476017 [14:01:11] (03PS2) 10GTirloni: wmcs: Exclude tracefs from check_disk [puppet] - 10https://gerrit.wikimedia.org/r/476016 (https://phabricator.wikimedia.org/T208465) [14:01:13] o/ hashar [14:01:41] hashar: Yesterday we deployed https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/475105/ but for some reason I don't see the change. Can you help find the issue? [14:01:53] (03CR) 10GTirloni: [C: 032] wmcs: Exclude tracefs from check_disk [puppet] - 10https://gerrit.wikimedia.org/r/476016 (https://phabricator.wikimedia.org/T208465) (owner: 10GTirloni) [14:02:07] hashar: I suspect it has to do something with CommonSettings-labs and InitialiseSettings-labs PHP files [14:02:10] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1105" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476017 (owner: 10Banyek) [14:02:29] (03CR) 10Banyek: [V: 032 C: 032] Revert "mariadb: depool db1105" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476017 (owner: 10Banyek) [14:02:43] hashar: do I need to move the config from InitialiseSettings-labs to CommonSettings-labs? [14:02:44] (03PS2) 10Banyek: Revert "mariadb: depool db1105" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476017 [14:02:47] (03CR) 10Banyek: [V: 032 C: 032] Revert "mariadb: depool db1105" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476017 (owner: 10Banyek) [14:04:15] (03PS1) 10Ema: ATS: log X-Cache-Status and X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/476018 (https://phabricator.wikimedia.org/T204225) [14:04:31] !log Cutting 1.33.0-wmf.6 branches | T206660 [14:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:35] T206660: 1.33.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T206660 [14:04:36] think you might need to pass it in via a wmg variable bmansurov [14:05:21] Krenair: hey, according to Timo yesterday, it wouldn't work. Do you think it will work? [14:05:24] (03PS3) 10Vgutierrez: netbox: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475980 (https://phabricator.wikimedia.org/T207050) [14:05:26] (03PS3) 10Vgutierrez: archiva: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475981 (https://phabricator.wikimedia.org/T207050) [14:05:28] (03PS1) 10Vgutierrez: sslcert: Avoid /etc/ssl/dhparam.pem redeclaration [puppet] - 10https://gerrit.wikimedia.org/r/476020 (https://phabricator.wikimedia.org/T207050) [14:05:34] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet fail on deployment-mediawiki-07, missing private hiera variable - https://phabricator.wikimedia.org/T210497 (10fgiunchedi) [14:05:46] not sure it's been a while [14:06:04] Krenair: ok [14:06:41] I wasn't aware the train and I merget at 14:00 a repooling [14:07:19] (03CR) 10Ema: [C: 032] ATS: log X-Cache-Status and X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/476018 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema) [14:07:27] I can hold the deploy until the train finishes, but I wanted to say out there's a change in the db config [14:08:36] (03CR) 10jenkins-bot: db-eqiad.php: Pool pc1010 in pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476015 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [14:08:39] (03CR) 10jenkins-bot: Revert "mariadb: depool db1105" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476017 (owner: 10Banyek) [14:10:09] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) pc1010 has been pooled into pc1 - T208383#4777571 [14:10:18] (03PS1) 10Andrew Bogott: Horizon: move more projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/476022 (https://phabricator.wikimedia.org/T204745) [14:10:32] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) [14:11:38] (03Abandoned) 10Vgutierrez: sslcert: Avoid /etc/ssl/dhparam.pem redeclaration [puppet] - 10https://gerrit.wikimedia.org/r/476020 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:11:59] (03CR) 10Andrew Bogott: [C: 032] Horizon: move more projects to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/476022 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [14:13:31] (03Abandoned) 10Vgutierrez: librenms: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475979 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:13:36] (03Abandoned) 10Vgutierrez: netbox: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475980 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:13:46] si [14:13:53] gah, not here [14:15:08] (03PS4) 10Vgutierrez: archiva: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/475981 (https://phabricator.wikimedia.org/T207050) [14:15:10] (03PS1) 10Vgutierrez: netmon: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/476025 (https://phabricator.wikimedia.org/T207050) [14:15:29] carrying a long-tail of commits in the same branch is a PITA :( sorry about the noise [14:16:58] (03CR) 10jerkins-bot: [V: 04-1] netmon: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/476025 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:17:01] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) Thanks @Cmjohnson it looks good now! ` RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3 Size : 4.364 TB Sector Size : 512 Is VD emulated... [14:17:50] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) [14:17:59] (03PS2) 10Vgutierrez: netmon: sslcert::dhparam needs to be included especifically now [puppet] - 10https://gerrit.wikimedia.org/r/476025 (https://phabricator.wikimedia.org/T207050) [14:18:29] (03CR) 10Vgutierrez: [C: 031] "pcc looks happy and shows the expected changes: https://puppet-compiler.wmflabs.org/compiler1002/13736/" [puppet] - 10https://gerrit.wikimedia.org/r/475981 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:20:09] 10Operations, 10hardware-requests: Procure logstash hardware in eqiad - https://phabricator.wikimedia.org/T210498 (10fgiunchedi) [14:20:43] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 108.3 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen [14:20:47] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi, 10User-herron: Logstash hardware expansion - https://phabricator.wikimedia.org/T203169 (10fgiunchedi) Procurement for eqiad hw is at {T210498} (Phase 2) [14:21:22] (03CR) 10Vgutierrez: [C: 031] "pcc is happy and shows the expected changes: https://puppet-compiler.wmflabs.org/compiler1002/13737/" [puppet] - 10https://gerrit.wikimedia.org/r/476025 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:21:39] (03CR) 10Vgutierrez: [C: 032] certcentral: Handle common requirements for certcentral clients [puppet] - 10https://gerrit.wikimedia.org/r/475968 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:21:48] (03PS6) 10Vgutierrez: certcentral: Handle common requirements for certcentral clients [puppet] - 10https://gerrit.wikimedia.org/r/475968 (https://phabricator.wikimedia.org/T207050) [14:22:55] Krenair: If I make some changes to a labs config file, would you be able to deploy it outside the SWAT window? [14:23:16] I'm not sure how to debug the issue otherwise. [14:23:41] bmansurov, I have not been a deployer for almost two years now [14:23:41] (03CR) 10Vgutierrez: [C: 032] install_server: Remove old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475965 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [14:23:49] (03PS5) 10Vgutierrez: install_server: Remove old LE puppetization [puppet] - 10https://gerrit.wikimedia.org/r/475965 (https://phabricator.wikimedia.org/T207050) [14:23:57] Krenair: oh, didn't realize. [14:24:08] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pc1008.eqiad.wmnet'] ` and were **ALL** successful. [14:28:24] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet fail on deployment-mediawiki-07, missing private hiera variable - https://phabricator.wikimedia.org/T210497 (10fgiunchedi) Temporarily unblocked/fixed by adding said variables to "project puppet" hiera config in Horizon [14:29:20] 10Operations, 10Traffic, 10Patch-For-Review: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez) [14:30:39] (03CR) 10Ottomata: [C: 031] Allow analytics-admins to restart daemons with systemctl [puppet] - 10https://gerrit.wikimedia.org/r/475984 (owner: 10Elukey) [14:31:02] (03CR) 10Ottomata: [C: 031] hue: add kerberos config support [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476011 (owner: 10Elukey) [14:32:17] (03PS1) 10Marostegui: db1095: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/476026 [14:32:19] (03PS1) 10Bmansurov: Labs: display reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476027 (https://phabricator.wikimedia.org/T209882) [14:32:53] (03CR) 10Marostegui: [C: 032] db1095: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/476026 (owner: 10Marostegui) [14:33:44] !log scap prep 1.33.0-wmf.6 | T206660 [14:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:47] T206660: 1.33.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T206660 [14:40:30] !log Applied security patches for 1.33.0-wmf.6 | T206660 [14:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:34] T206660: 1.33.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T206660 [14:41:35] (03PS1) 10Hashar: Group0 to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476028 (https://phabricator.wikimedia.org/T206660) [14:41:38] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) Dell ticket information for pc1007 You have successfully submitted request SR983104667. [14:42:21] (03PS1) 10Filippo Giunchedi: hieradata: add kafka_shipper::kafka_brokers variable to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/476029 [14:44:34] 10Operations, 10Deployments: Make failures on foreachwiki more obvious the deployer - https://phabricator.wikimedia.org/T210474 (10Anomie) > Perhaps the failures could be gathered and displayed at the end? What's a "failure"? As mentioned at T209674#4754656, it seems unlikely that the generic `foreachwiki` s... [14:46:58] (03PS9) 10Muehlenhoff: Script to generate service principals/keytabs (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/470566 [14:48:15] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['backup2001.codfw.wmnet'] ` Of which those **FAILED**: ` ['backup2001.codfw.wmnet'] ` [14:52:18] !log hashar@deploy1001 Pruned MediaWiki: 1.32.0-wmf.26 (duration: 09m 48s) [14:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:09] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:56:29] RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational [15:02:32] !log hashar@deploy1001 Started scap: testwiki to php-1.33.0-wmf.6 and rebuild l10n cache | T206660 [15:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:35] T206660: 1.33.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T206660 [15:03:44] (03CR) 10Alexandros Kosiaris: [C: 031] puppet-merge: allow only showing diffs without merging [puppet] - 10https://gerrit.wikimedia.org/r/476008 (owner: 10Giuseppe Lavagetto) [15:04:53] (03PS1) 10Alexandros Kosiaris: Set backup2001's 10G interface in DHCP/PXE [puppet] - 10https://gerrit.wikimedia.org/r/476032 (https://phabricator.wikimedia.org/T196478) [15:06:09] (03CR) 10Alexandros Kosiaris: [C: 032] Set backup2001's 10G interface in DHCP/PXE [puppet] - 10https://gerrit.wikimedia.org/r/476032 (https://phabricator.wikimedia.org/T196478) (owner: 10Alexandros Kosiaris) [15:21:27] (03PS4) 10Banyek: mariadb: productionize dbproxy1015 and dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) [15:21:53] !log create graphoid namespace on kubernetes eqiad, codfw, staging clusters T203091 [15:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:57] T203091: Move Graphoid to Kubernetes via the deployment pipeline - https://phabricator.wikimedia.org/T203091 [15:25:10] (03PS1) 10Elukey: hive: fix typo and allow principal in database config [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476035 [15:25:58] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Halfak) The second bullet point ("filter RC for edits which are ORES-nondamaging and JADE-damaging") seems like a product propos... [15:27:13] (03PS2) 10Elukey: hive: fix typo and allow principal in database config [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476035 [15:28:07] (03CR) 10Elukey: [C: 032] hive: fix typo and allow principal in database config [puppet/cdh] - 10https://gerrit.wikimedia.org/r/476035 (owner: 10Elukey) [15:30:46] (03PS1) 10Elukey: profile::hive::client: fix typo and update cdh module [puppet] - 10https://gerrit.wikimedia.org/r/476036 [15:31:34] (03PS10) 10Muehlenhoff: Script to generate service principals/keytabs (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/470566 [15:35:14] !log hashar@deploy1001 Finished scap: testwiki to php-1.33.0-wmf.6 and rebuild l10n cache | T206660 (duration: 32m 42s) [15:35:15] !log einsteinium - stopped icinga, stopped nsca, stopped rsyncd, killall -u icinga, killall -u nagios ... T209738 [15:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:17] T206660: 1.33.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T206660 [15:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:20] T209738: decom einsteinium - https://phabricator.wikimedia.org/T209738 [15:37:12] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13738/" [puppet] - 10https://gerrit.wikimedia.org/r/476036 (owner: 10Elukey) [15:37:22] mutante: best to also remove the icinga packages I'd say [15:38:19] (03PS1) 10Papaul: DHCP: Add MAC address entries for restbase201[3-8] [puppet] - 10https://gerrit.wikimedia.org/r/476038 (https://phabricator.wikimedia.org/T209615) [15:38:54] (03PS4) 10Jcrespo: admin: Add Jeena Huneidi access to the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/475986 (https://phabricator.wikimedia.org/T210027) [15:38:56] (03PS2) 10Jcrespo: admin: Add jgleeson access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476004 (https://phabricator.wikimedia.org/T208432) [15:38:58] (03PS1) 10Jcrespo: admin: Add Ryan Steinberg and Joe Wass access to production cluster [puppet] - 10https://gerrit.wikimedia.org/r/476039 (https://phabricator.wikimedia.org/T209298) [15:39:49] !log rebooting kafkamon2001 for kernel security update [15:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:31] !log einsteinium - removed icinga package [15:41:31] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:11] moritzm: ok, done! and that timing above made me almost think i ran it on the wrong host.. it just added backup2001 though [15:43:46] :-) [15:44:13] !log rebooting kafkamon1001 for kernel security update [15:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:15] (03CR) 10Volans: "I didn't test PS2 but looks good to me. +1 for me apart the open discussion about the extract_text_from_pdf() functions, for which I don'" (035 comments) [software] - 10https://gerrit.wikimedia.org/r/475933 (owner: 10Faidon Liambotis) [15:46:21] (03PS1) 10Papaul: PARTMAN: Add restbase201[3-8] [puppet] - 10https://gerrit.wikimedia.org/r/476040 (https://phabricator.wikimedia.org/T209615) [15:46:31] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:46:37] so long einsteinium! [15:46:39] (03CR) 10Hashar: [C: 032] Group0 to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476028 (https://phabricator.wikimedia.org/T206660) (owner: 10Hashar) [15:47:43] (03Merged) 10jenkins-bot: Group0 to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476028 (https://phabricator.wikimedia.org/T206660) (owner: 10Hashar) [15:48:45] (03CR) 10jenkins-bot: Group0 to 1.33.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476028 (https://phabricator.wikimedia.org/T206660) (owner: 10Hashar) [15:51:47] (03PS3) 10Cwhite: mw_rc_irc: remove diamond::collector resource and collector script [puppet] - 10https://gerrit.wikimedia.org/r/475010 (https://phabricator.wikimedia.org/T183454) [15:52:21] (03CR) 10jerkins-bot: [V: 04-1] mw_rc_irc: remove diamond::collector resource and collector script [puppet] - 10https://gerrit.wikimedia.org/r/475010 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [15:52:31] (03CR) 10Jforrester: [C: 04-2] "Reverts to patches from the security team should only be merged by their explicit consent. Putting in a C-2 just to make sure." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475947 (owner: 10Chad) [15:53:16] lets roll it [15:53:25] !log rebooting planet2001 for kernel security update [15:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:31] (03PS1) 10Ottomata: Use proper Kafka cluster for sending EventLogging errors to logstash [puppet] - 10https://gerrit.wikimedia.org/r/476043 (https://phabricator.wikimedia.org/T205437) [15:53:33] almost [15:54:10] there is a patch for Revert "mariadb: depool db1105" by banyek [15:54:11] (03CR) 10Faidon Liambotis: Initial commit of quotereviewer (033 comments) [software] - 10https://gerrit.wikimedia.org/r/475933 (owner: 10Faidon Liambotis) [15:54:32] (03PS3) 10Faidon Liambotis: Initial commit of quotereviewer [software] - 10https://gerrit.wikimedia.org/r/475933 [15:54:41] (03CR) 10Ottomata: [C: 032] Use proper Kafka cluster for sending EventLogging errors to logstash [puppet] - 10https://gerrit.wikimedia.org/r/476043 (https://phabricator.wikimedia.org/T205437) (owner: 10Ottomata) [15:56:21] (03PS1) 10Arturo Borrero Gonzalez: openstack: clientpackages: introduce support for mitaka/stretch [puppet] - 10https://gerrit.wikimedia.org/r/476044 (https://phabricator.wikimedia.org/T209948) [15:56:38] (03Abandoned) 10Cwhite: mw_rc_irc: remove diamond::collector resource and collector script [puppet] - 10https://gerrit.wikimedia.org/r/475010 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [15:57:07] (checking with dbas) [15:57:27] (03PS2) 10Arturo Borrero Gonzalez: openstack: clientpackages: introduce support for mitaka/stretch [puppet] - 10https://gerrit.wikimedia.org/r/476044 (https://phabricator.wikimedia.org/T209948) [15:58:38] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: clientpackages: introduce support for mitaka/stretch [puppet] - 10https://gerrit.wikimedia.org/r/476044 (https://phabricator.wikimedia.org/T209948) (owner: 10Arturo Borrero Gonzalez) [15:59:02] (03CR) 10Cwhite: [C: 031] Absent Redis Diamond collector on Redis slaves [puppet] - 10https://gerrit.wikimedia.org/r/475967 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:59:05] (03CR) 10Ottomata: [C: 031] "+1 to the idea, we should ask a logstash person if this makes sense as a tag" [puppet] - 10https://gerrit.wikimedia.org/r/475322 (https://phabricator.wikimedia.org/T205437) (owner: 10Phuedx) [15:59:42] (03CR) 10Cwhite: [C: 031] Absent Redis Diamond collector on Redis masters [puppet] - 10https://gerrit.wikimedia.org/r/475969 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:04:58] o/ didn't see that i'd dc'ed [16:04:59] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1105 (duration: 00m 53s) [16:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:33] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [16:05:43] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [16:05:57] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [16:06:09] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [16:06:13] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [16:06:17] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [16:06:59] hmm elukey ottomata ^ ? [16:07:24] akosiaris: oom party probably, somebody misusing the host :( [16:07:25] seems like nagios-nrpe-server crashed.. looking if that's it [16:07:43] [Tue Nov 27 16:04:08 2018] nrpe[14021]: segfault at 1 ip 000055be5969ce91 sp 00007ffd77d74d70 error 6 in nrpe[55be59696000+e000] [16:07:44] nope [16:07:45] it's out of disk [16:07:55] sigh [16:07:59] RECOVERY - Host ms-be2047.mgmt is UP: PING WARNING - Packet loss = 61%, RTA = 42.39 ms [16:08:00] -bash: fork: Cannot allocate memory [16:08:00] hit it with the hammer [16:08:03] WHO DID IT?! [16:08:12] get out the blame cannons [16:08:25] I don't see anything about a recent oom [16:08:36] most recent is on Nov 8 [16:08:36] akosiaris: we can't log in -bash: fork: Cannot allocate memory [16:08:41] I did [16:08:52] i got the message but was logged in anyways [16:09:01] me too, and I don't see anything filled up [16:09:05] ah wait I can not see the memory increased to 100% [16:09:07] maybe it was temp? [16:09:20] (03PS2) 10Bmansurov: Labs: display reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476027 (https://phabricator.wikimedia.org/T209882) [16:09:22] it's otto btw the user of the worst culprit CPU wise [16:09:23] :P [16:09:29] wut no way [16:09:31] ? [16:09:36] Job for nagios-nrpe-server.service failed because of unavailable resources or another system error. [16:09:41] ^ this is why the alerts then [16:09:49] !log rebooting planet1001 for kernel security update [16:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:27] something about toree and spark and jupyter running in java since Oct 18 ? [16:10:27] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [16:10:28] jupyterhub crashed ? [16:10:42] (03CR) 10Ottomata: [C: 031] Add the Hadoop worker nodes' racking awareness config [puppet] - 10https://gerrit.wikimedia.org/r/474904 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [16:10:52] akosiaris: jupyter loves to keep kernels running [16:10:55] godog: yt? i had a question about tags vs fields in logstash and was wondering if you could help/knew who to ask [16:10:56] would quit them if i could log in :p [16:11:03] ok lemme kill that one [16:11:08] plz kill away [16:11:19] ah i thinki'm in [16:11:24] ah great [16:11:28] I can't call kill [16:11:32] cause no memory [16:11:39] shutting down via UI [16:11:43] PROBLEM - Check the NTP synchronisation status of timesyncd on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [16:11:51] logging out to give back space :p [16:11:53] phuedx: hi! sure what's up? it is mainly me and herron looking after logstash atm [16:12:08] ok maybe that fixed it.. it just released 3-4 GB [16:12:40] godog: context: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475322/ [16:12:49] also.. i had run 'apt-get clean' to free a little space [16:12:52] there is another by a user called dsaez that has 18G of RSS and another with 12G RSS [16:12:52] phuedx: looking at your tag change more, i see that code, message and raw event are all being pulled out into a top level field [16:12:58] maybe we shoudl just do the same with event.schema ? [16:12:59] akosiaris: systemctl start nagios-nrpe-server ? [16:13:57] godog: i'd like it to be trivial to filter eventlogging validation errors sent to logstash by eventlogging schema [16:13:59] RECOVERY - Disk space on notebook1003 is OK: DISK OK [16:14:03] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.33.0-wmf.6 [16:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:11] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [16:14:14] ok fixed, /me logging out, this is a ticking time bomb ofc [16:14:21] but from what I gather it's always like that [16:14:25] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [16:14:29] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [16:14:31] godog: afaict, we can do that by either pulling out the event.schema property out into a top-level field for the log line or by making it a tag [16:14:33] RECOVERY - DPKG on notebook1003 is OK: All packages OK [16:14:57] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [16:15:07] godog: i was wondering if there's guidelines as to whether to use tags vs fields [16:15:16] ottomata: i guess it makes sense to be consistent [16:15:36] phuedx: ack, yeah I reckon a field would be better as tags as used now are mostly metadata for the message as it goes through the logstash pipelines [16:15:37] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [16:15:55] phuedx: I'm not aware of guidelines but for sure would be nice to have some, I'll comment on the review [16:16:13] godog: thanks! that's very helpful [16:16:25] ottomata, godog: i'll update the change to pull out the schema into a top-level field [16:17:14] !log einsteinium - apt-get remove --purge icinga nsca; apt-get autoremove ; apt-get remove --purge icinga-doc icinga-common icinga-cgi-bin icinga-cgi; apt-get remove --purge monitoring-plugin* ; rm /etc/rsync.d/frag-icinga* T209738 [16:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:18] T209738: decom einsteinium - https://phabricator.wikimedia.org/T209738 [16:17:56] (03PS2) 10Bstorm: toolforge: add qpdf, unpaper, and pngquant to exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/475233 (https://phabricator.wikimedia.org/T204422) [16:20:09] PROBLEM - Host ms-be2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:20:55] (03CR) 10Filippo Giunchedi: "I think a field makes more sense in this case: tags are as used now are mostly metadata for the message as it goes through the logstash pi" [puppet] - 10https://gerrit.wikimedia.org/r/475322 (https://phabricator.wikimedia.org/T205437) (owner: 10Phuedx) [16:22:33] (03PS3) 10Phuedx: eventlogging/logstash: Make schema a field [puppet] - 10https://gerrit.wikimedia.org/r/475322 (https://phabricator.wikimedia.org/T205437) [16:23:11] (03PS4) 10Phuedx: eventlogging/logstash: Make schema a field [puppet] - 10https://gerrit.wikimedia.org/r/475322 (https://phabricator.wikimedia.org/T205437) [16:23:27] phuedx: let's do it [16:23:36] ottomata: done! [16:23:58] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/475322 (https://phabricator.wikimedia.org/T205437) (owner: 10Phuedx) [16:24:22] godog: thanks again :) [16:24:51] Hmm phuedx [16:24:59] actually godog [16:25:01] what happens here [16:25:12] there is already a top level 'schema' field in the log [16:25:24] for these [16:25:32] it is always schema: EventError [16:25:37] event.schema is the schema that caused the error [16:26:01] not really sure what this rename => config does [16:26:37] ottomata: if there's already a schema field rename will overwrite it iirc [16:27:04] godog: i'm looking at one of these in kibana now [16:27:09] and I don't see any top level fields 'renamed' [16:27:10] now [16:27:18] e.g. i don't see a 'code' at top level anywhere [16:28:26] ottomata: i've just noticed that too [16:28:47] it looks like the mutate filter isn't taking effect [16:28:58] yeah doesn't look like eventlogging_EventError is ever in 'tags' so the first if guard never matches [16:30:01] i.e. https://logstash.wikimedia.org/goto/c4a8b7c6cd8721cbc8dd5296dce22db3 [16:30:13] well eventlogging_EventError is being removed from tags [16:30:19] i think mutate is working [16:30:23] it is adding level => ERROR [16:30:33] godog, ottomata: is it a codec problem? [16:30:43] ottomata: ah! indeed you are right so that's expected [16:31:04] the eventlogging_EventError is added elsewhere and then removed by the filter afaict [16:31:26] trying to see where eventlogging_EventError is added as a tag [16:31:28] but i don't se that [16:31:34] unless the kafka input does it implicitly by the topic [16:31:56] ottomata: ^ that's exactly what it does [16:31:57] (iirc) [16:32:44] phedenskog: i think it is a codec problem....maybe [16:32:51] i see that "message" in logstash is a raw string [16:32:59] so I think its not parsing it as json [16:33:48] ottomata: wait... i'm not actually sure about that any more. i'm looking at the kafka input template and i don't see it adding the topic to the tags [16:34:39] ottomata: wait... i got confused. the topic is added as a tag. sorry for the flip flop [16:34:55] (03CR) 10Volans: [C: 031] "LGTM" (033 comments) [software] - 10https://gerrit.wikimedia.org/r/475933 (owner: 10Faidon Liambotis) [16:39:16] ottomata, godog: new change is up to change the codec for the input. i'm not sure how to test it :/ [16:39:31] phuedx: in standup, will help test shortly [16:39:32] (03PS5) 10Phuedx: eventlogging/logstash: Make schema a field [puppet] - 10https://gerrit.wikimedia.org/r/475322 (https://phabricator.wikimedia.org/T205437) [16:39:34] (03PS1) 10Phuedx: eventlogging/logstash: Events are encoded as JSON [puppet] - 10https://gerrit.wikimedia.org/r/476050 (https://phabricator.wikimedia.org/T205437) [16:41:53] RECOVERY - Check the NTP synchronisation status of timesyncd on notebook1003 is OK: OK: synced at Tue 2018-11-27 16:41:52 UTC. [16:42:16] 10Operations, 10ops-codfw, 10netops: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10akosiaris) > ores2001, 2*ganeti, 15*mw > cc @akosiaris to know what specific actions need to be taken for Ores and Ganeti for ores2001, nothing is really required aside from some downtime in i... [16:43:22] 10Operations, 10ops-codfw, 10netops: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10akosiaris) > @akosiaris for ores2008 Just schedule downtime in icinga and do whatever actions are required. The service will happily keep chugging along on the other 8 hosts in eqiad. [16:48:43] that worked phuedx [16:49:34] phuedx: actually, why are we pulling out these event.* fields into top level at all? [16:49:38] can we not just earch on event.schema ? [16:50:07] 10Operations, 10decommission, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, and 2 others: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10Andrew) a:03Andrew [16:50:32] maybe not [16:51:27] ya you can! [16:51:32] event.schema:UniversalLanguageSelector [16:53:03] (03CR) 10Ottomata: [C: 032] eventlogging/logstash: Events are encoded as JSON [puppet] - 10https://gerrit.wikimedia.org/r/476050 (https://phabricator.wikimedia.org/T205437) (owner: 10Phuedx) [16:53:22] ottomata: yeah! i think we can drop my second change [16:53:36] just wondering why we are pulling out those fields at all [16:53:38] maybe we shoudl stop doing that [16:53:41] maybe just keep message [16:53:42] but that's it [16:53:56] (03CR) 10Phuedx: [C: 04-1] "Ottomata points out that we can just search for the event.schema field." [puppet] - 10https://gerrit.wikimedia.org/r/475322 (https://phabricator.wikimedia.org/T205437) (owner: 10Phuedx) [16:54:32] oh phuedx maybe its for elastic search reasons? [16:54:40] maybe its easier to search and cache top level fields? [16:54:57] godog: any idea there? [16:55:53] ottomata: good question, I don't know tbh e.g. if es treats field names specially [16:56:24] any idea who we should ask? [16:57:14] my guess would be ebernhardson / gehel / dcausse [16:57:19] what about field nameS? [16:57:41] ebernhardson: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475322/3/modules/role/files/logstash/filter-eventlogging.conf [16:57:53] unsure why we are renaming sub level fields to top level ones [16:57:58] this is ELK stuff [16:58:14] is there some elasticsearch reason why we wouldn't want to leave the fields in the event subobject? [16:58:41] ottomata: no, elastic will query sub fields just fine. There are wierd edge cases, but if you are querying a single field it doesn't matter [17:00:05] godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181127T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:30] ottomata: the only possiblity would be if `event` was marked no-index, but i don't think we've marked anything no-index in logstash yet [17:00:57] (03PS1) 10Alexandros Kosiaris: Add graphoid kubernetes tokens [puppet] - 10https://gerrit.wikimedia.org/r/476052 (https://phabricator.wikimedia.org/T203091) [17:01:31] ebernhardson, ottomata: there do seem to be limitations in kibana's query support for nested objects though [17:01:48] oh ya? [17:01:52] there may still be good reason to pull out the event.schema property and index it [17:01:52] phuedx: how so? Everything should be dotted notation [17:02:17] ottomata: i can't seem to get the "event.schema:Foo" search in kibana to fail, it just returns all loglines :/ [17:02:29] phuedx: like when you send a doc {"foo": {"bar": "baz"}} what elasticsearch actually indexes is "foo.bar": "baz". The other structure is lost [17:02:41] https://www.elastic.co/guide/en/kibana/current/nested-objects.html [17:02:56] phuedx: thats a different kind of nested object [17:03:05] phuedx: those are for child documents [17:03:13] * phuedx is now lost ;) [17:03:37] phuedx: my search works, but i don't know how to share it with you [17:03:44] phuedx: when you index {"foo": {"bar": "baz"}} and "foo" is a child document, then you index {"foo": "doc 1234567"}, and {"bar: "baz"} separately [17:03:45] i just have event.schema:UniversalLanguageSelector in the search box [17:03:56] phuedx: but we don't use child documents in logstash, so you get {"foo.bar": "baz"} [17:04:07] 10Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS puppet/hiera configs are too distributed - https://phabricator.wikimedia.org/T210431 (10jcrespo) Your ticket seems reasonable, however I am unsure if you need something from #Operations explicitly at the moment- AFAIK, nothing in the puppet style polic... [17:04:17] ottomata: have you tried it with a schema that doesn't exist? [17:04:44] phuedx: i get no results found [17:04:49] huh [17:04:55] event.schema:Nonya [17:05:06] i don't have any other filters in the search though [17:05:09] just that [17:05:29] * phuedx facepalms [17:05:29] remember that kibana's default combiner is 'OR', so if you want to combine things you need 'AND' between them [17:05:49] i missed the and ;) [17:05:53] ^ what ebernhardson said [17:06:14] i suppose or/and is annoying to deal with. I typically use + and - [17:06:20] (03CR) 10Alexandros Kosiaris: [C: 032] Add graphoid kubernetes tokens [puppet] - 10https://gerrit.wikimedia.org/r/476052 (https://phabricator.wikimedia.org/T203091) (owner: 10Alexandros Kosiaris) [17:06:26] ottomata: it might be prudent to investigate what fields we're pulling out and why but i'm calling the task done [17:06:41] ok, since we are resurecting now, i'm going to go ahead and do what I think is best: [17:06:47] i'm only going to pull out message and 'host' [17:06:48] that's it [17:06:53] code and rawEvent will stay in event. [17:06:54] ya ok? [17:07:25] i can't see why we'd remove this eventloggingEventError tag anyway [17:07:30] i'm going to leave it in too [17:07:34] (03Abandoned) 10Phuedx: eventlogging/logstash: Make schema a field [puppet] - 10https://gerrit.wikimedia.org/r/475322 (https://phabricator.wikimedia.org/T205437) (owner: 10Phuedx) [17:07:55] ottomata: to share logstash searches, click 'Share' in top right On the right side it says 'Link' and there is a 'Short URL' button. Click the short url, then copy it and paste wherever [17:08:09] ottomata: +1 that all seems sensible to me [17:08:21] k [17:08:31] ah that worked thanks ebernhardson [17:09:40] (03CR) 10Greg Grossmeier: [C: 031] "Thanks Jaime!" [puppet] - 10https://gerrit.wikimedia.org/r/475986 (https://phabricator.wikimedia.org/T210027) (owner: 10Jcrespo) [17:09:49] (03PS1) 10Ottomata: Only rename event.message and recvFrom for eventlogging_EventError logstash [puppet] - 10https://gerrit.wikimedia.org/r/476057 (https://phabricator.wikimedia.org/T205437) [17:10:22] (03PS5) 10Jcrespo: admin: Add Jeena Huneidi access to the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/475986 (https://phabricator.wikimedia.org/T210027) [17:11:33] * godog mumbles something about wishing to have tests for logstash configuration [17:12:01] (03CR) 10Jcrespo: [C: 032] admin: Add Jeena Huneidi access to the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/475986 (https://phabricator.wikimedia.org/T210027) (owner: 10Jcrespo) [17:12:28] phuedx: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476057/ [17:12:32] (03CR) 10Phuedx: [C: 031] Only rename event.message and recvFrom for eventlogging_EventError logstash [puppet] - 10https://gerrit.wikimedia.org/r/476057 (https://phabricator.wikimedia.org/T205437) (owner: 10Ottomata) [17:12:50] (03CR) 10Ottomata: [C: 032] Only rename event.message and recvFrom for eventlogging_EventError logstash [puppet] - 10https://gerrit.wikimedia.org/r/476057 (https://phabricator.wikimedia.org/T205437) (owner: 10Ottomata) [17:12:54] (03PS2) 10Ottomata: Only rename event.message and recvFrom for eventlogging_EventError logstash [puppet] - 10https://gerrit.wikimedia.org/r/476057 (https://phabricator.wikimedia.org/T205437) [17:12:54] (03CR) 10Ottomata: [V: 032 C: 032] Only rename event.message and recvFrom for eventlogging_EventError logstash [puppet] - 10https://gerrit.wikimedia.org/r/476057 (https://phabricator.wikimedia.org/T205437) (owner: 10Ottomata) [17:13:21] (03CR) 10Phuedx: [C: 031] "Everything else that we /might/ need when looking at EL validation errors in logstash are already encoded in the event." [puppet] - 10https://gerrit.wikimedia.org/r/476057 (https://phabricator.wikimedia.org/T205437) (owner: 10Ottomata) [17:13:28] (03PS6) 10Jcrespo: admin: Add Jeena Huneidi access to the production cluster [puppet] - 10https://gerrit.wikimedia.org/r/475986 (https://phabricator.wikimedia.org/T210027) [17:13:40] ottomata: i'm gonna resolve https://phabricator.wikimedia.org/T205437 [17:13:43] come, on ottomata that is cheating :-) [17:13:44] thanks for your help :) [17:14:12] jynus: what is CHeating?! [17:14:26] the rebase + verify ? :p [17:14:28] * jynus notices I was first in the deploying queue [17:14:31] ohhh [17:14:33] haha [17:14:36] i'm in and out tho [17:14:36] just kidding [17:14:37] proceed! [17:15:15] this is great, thanks phuedx for looking into it [17:15:58] ottomata: massive +1 and thanks for helping finish it off. i would never have spotted the wrong kafka broker problem [17:16:09] it's cleaner now too :) [17:16:18] indeed! [17:16:22] https://logstash.wikimedia.org/goto/bda91f37481ae4970ee21e11810d49d3 [17:16:23] :) [17:16:29] I can't access _security channel again. Can anybody invite me please.. [17:19:47] RECOVERY - Host ms-be2047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.94 ms [17:21:29] (03PS1) 10Imarlier: config: move wgMFNoindexPages to InitialiseSettings-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476060 (https://phabricator.wikimedia.org/T206497) [17:23:41] !log T207377 icinga downtime labnet1001 [17:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:44] T207377: Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 [17:24:09] ottomata: did you apply those changes locally on logstash1007 and they'll eventually be overridden on the puppet run? [17:25:20] !log T209517 icinga downtime labsdb1005 [17:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:23] T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 [17:25:29] cc bstorm_ banyek ^^^ [17:25:45] Ah, thanks :) [17:25:54] * banyek brace himself [17:25:56] I was just going to get that [17:26:17] phuedx: i merged the puppet chagnes [17:26:19] downtime labsdb1004 mysql replication too, in case it alerts [17:26:23] you can abandon your change [17:26:40] (03PS1) 10Ayounsi: Revert "Disable traffic to ulsfo for providers maintenance" [dns] - 10https://gerrit.wikimedia.org/r/476062 [17:27:05] (03PS2) 10Ayounsi: Revert "Disable traffic to ulsfo for providers maintenance" [dns] - 10https://gerrit.wikimedia.org/r/476062 [17:27:14] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4777873, @Halfak wrote: > The second bullet point ("filter RC for edits which are ORES-nondamaging and JA... [17:27:24] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: access request for Jeena Huneidi (deployment, conint-admins, contint-docker) - https://phabricator.wikimedia.org/T210027 (10jcrespo) ` Notice: /Stage[main]/Admin/Admin::Hashuser[jhuneidi]/Admin::User[jhuneidi]/User[jhuneidi]/ensure: created ` Please p... [17:27:32] no monitoring of replication, as we don't want to get paged often due to user large transactions, so ignore my comment [17:27:42] but it is an ok advice in most cases [17:28:09] (03CR) 10BBlack: [C: 031] Revert "Disable traffic to ulsfo for providers maintenance" [dns] - 10https://gerrit.wikimedia.org/r/476062 (owner: 10Ayounsi) [17:28:47] Ok sure [17:30:10] (03CR) 10Ayounsi: [C: 032] Revert "Disable traffic to ulsfo for providers maintenance" [dns] - 10https://gerrit.wikimedia.org/r/476062 (owner: 10Ayounsi) [17:30:29] !log repool uslfo [17:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:49] !log T209517 icinga downtime labsdb1004 [17:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:52] T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 [17:58:34] (03Merged) 10jenkins-bot: Initial commit of quotereviewer [software] - 10https://gerrit.wikimedia.org/r/475933 (owner: 10Faidon Liambotis) [17:58:40] Pchelolo: yes, an alarm in http codes is sufficient, we might want to be extra good and do a ratio so as not to have false alarms when volume increases like http 400s/http200s (makes sense?) we do not really need to do this now [17:59:19] 10Operations, 10monitoring: Icinga downtime script should fail on the passive hosts - https://phabricator.wikimedia.org/T210380 (10Dzahn) a:03Dzahn [18:00:06] cscott, arlolra, subbu, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181127T1800). [18:00:30] halfak|Lunch: awight Is it okay if I deploy something for ores? [18:01:13] nuria: this particular alert should not false-alarm cause we're actually aiming for 0 400 errors [18:02:18] (03PS2) 10Muehlenhoff: Remove Diamond on additional DB roles [puppet] - 10https://gerrit.wikimedia.org/r/476001 (https://phabricator.wikimedia.org/T183454) [18:02:29] (03PS4) 10Bstorm: toolforge: add qpdf, unpaper, and pngquant to exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/475233 (https://phabricator.wikimedia.org/T204422) [18:03:56] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: no op sync wikiversions to test scap 3.8.9-1 [18:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:02] (03CR) 10Bstorm: [C: 032] toolforge: add qpdf, unpaper, and pngquant to exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/475233 (https://phabricator.wikimedia.org/T204422) (owner: 10Bstorm) [18:04:47] 10Operations, 10Analytics, 10EventBus, 10WMF-JobQueue, and 6 others: Kafka eqiad.mediawiki.page-delete topic is empty - https://phabricator.wikimedia.org/T210451 (10Smalyshev) Yep, seeing the events in grafana now, so I think it's all good now. Thanks! [18:05:00] (03CR) 10Muehlenhoff: [C: 031] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/475984 (owner: 10Elukey) [18:06:28] 10Operations, 10Analytics, 10EventBus, 10WMF-JobQueue, and 6 others: Kafka eqiad.mediawiki.page-delete topic is empty - https://phabricator.wikimedia.org/T210451 (10Pchelolo) Do you need the events for the last month to be replayed? [18:06:42] godog: all looks good, thanks for the scap update! [18:07:02] thcipriani: np! [18:09:56] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 72.6 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [18:10:46] 10Operations, 10DBA, 10MediaWiki-Change-tagging, 10MW-1.33-notes (1.33.0-wmf.6; 2018-11-27), and 3 others: Migrate tag_summary usage to change_tag and drop the table - https://phabricator.wikimedia.org/T209525 (10Ladsgroup) a:03Ladsgroup [18:14:14] 10Operations, 10Analytics, 10EventBus, 10WMF-JobQueue, and 6 others: Kafka eqiad.mediawiki.page-delete topic is empty - https://phabricator.wikimedia.org/T210451 (10Smalyshev) @Pchelolo No I already updated the affected items manually. [18:15:21] greg-g: is there a task tracking the beta cluster instability? Also is it unbreak now (or at least high)? We've not got any integration test coverage on mobile and that's scaring us due to "read only" errors. [18:15:50] jdlrobson: where did you and Krenair get yesterday? [18:15:55] how far in the debugging? [18:16:04] last I saw he had some open questions still [18:16:14] over in -releng, that is (where beta cluster is on topic) [18:16:31] (and slightly less bot-filled ;) ) [18:17:40] nowhere :/ [18:17:48] ok heading over there [18:18:12] (03CR) 10Gehel: [C: 031] Absent Redis Diamond collector on Redis masters [puppet] - 10https://gerrit.wikimedia.org/r/475969 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [18:28:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: access request for Jeena Huneidi (deployment, conint-admins, contint-docker) - https://phabricator.wikimedia.org/T210027 (10jcrespo) [18:28:54] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) 05Open>03declined The numbers above seem to indicate that we don't have a good signal / noise ratio, so an icinga ch... [18:29:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: access request for Jeena Huneidi (deployment, conint-admins, contint-docker) - https://phabricator.wikimedia.org/T210027 (10jcrespo) 05Open>03Resolved Access request was tested, it was possible to loging to a bastion host, deployment host and to gr... [18:32:25] hi akosiaris [18:32:56] I'm a bit lost, which task are you talkig about? [18:33:54] (03PS1) 10Bstorm: wiki replicas: depool labsdb1009 for upgrades [puppet] - 10https://gerrit.wikimedia.org/r/476079 (https://phabricator.wikimedia.org/T209517) [18:43:30] (03CR) 10Gehel: [C: 032] elasticsearch_cluster: Added multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [18:43:58] (03PS1) 10Mforns: Add analytics reportupdater job for language cx [puppet] - 10https://gerrit.wikimedia.org/r/476081 (https://phabricator.wikimedia.org/T189475) [18:46:12] (03CR) 10Ottomata: [C: 032] Add analytics reportupdater job for language cx [puppet] - 10https://gerrit.wikimedia.org/r/476081 (https://phabricator.wikimedia.org/T189475) (owner: 10Mforns) [18:46:21] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) >>! In T200297#4778378, @awight wrote: >>>! In T200297#4777873, @Halfak wrote: > Harej and I chatted about this yesterda... [18:46:57] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10Patch-For-Review, 10Wikimedia-production-error: wmf.14 failing to execute ThumbnailRender jobs "error: ThumbnailRenderJob::run: HTTP request failure" - https://phabricator.wikimedia.org/T200346 (10Imarlier) [18:47:17] (03PS1) 10Thcipriani: Scap: upgrade to 3.8.10-1 [puppet] - 10https://gerrit.wikimedia.org/r/476082 (https://phabricator.wikimedia.org/T210469) [18:48:38] (03PS1) 10BryanDavis: cloud vps: google-api-proxy: Update allowed CIDR ranges [puppet] - 10https://gerrit.wikimedia.org/r/476083 [18:49:31] (03CR) 10jerkins-bot: [V: 04-1] cloud vps: google-api-proxy: Update allowed CIDR ranges [puppet] - 10https://gerrit.wikimedia.org/r/476083 (owner: 10BryanDavis) [18:50:30] 10Operations, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: Update Debian Package for Scap to 3.8.9-1 - https://phabricator.wikimedia.org/T210469 (10thcipriani) I checked the command run by sync-wikiversions after the package was updated, and that looked correct; however, I now realize that the s... [18:50:47] (03PS2) 10BryanDavis: cloud vps: google-api-proxy: Update allowed CIDR ranges [puppet] - 10https://gerrit.wikimedia.org/r/476083 [18:51:51] 10Operations, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: Update Debian Package for Scap to 3.8.10-1 - https://phabricator.wikimedia.org/T210469 (10thcipriani) [18:52:00] (03PS3) 10Andrew Bogott: cloud vps: google-api-proxy: Update allowed CIDR ranges [puppet] - 10https://gerrit.wikimedia.org/r/476083 (owner: 10BryanDavis) [18:52:38] (03CR) 10Andrew Bogott: [C: 032] cloud vps: google-api-proxy: Update allowed CIDR ranges [puppet] - 10https://gerrit.wikimedia.org/r/476083 (owner: 10BryanDavis) [18:54:36] (03CR) 10Ottomata: [C: 032] Add analytics to EventBus grafana alerts contact group. [puppet] - 10https://gerrit.wikimedia.org/r/476070 (https://phabricator.wikimedia.org/T210031) (owner: 10Ppchelko) [18:54:43] (03PS2) 10Ottomata: Add analytics to EventBus grafana alerts contact group. [puppet] - 10https://gerrit.wikimedia.org/r/476070 (https://phabricator.wikimedia.org/T210031) (owner: 10Ppchelko) [18:54:49] (03CR) 10Ottomata: [V: 032 C: 032] Add analytics to EventBus grafana alerts contact group. [puppet] - 10https://gerrit.wikimedia.org/r/476070 (https://phabricator.wikimedia.org/T210031) (owner: 10Ppchelko) [18:58:37] (03PS1) 10Elukey: matomo: disable unused features that cause load timeouts [puppet] - 10https://gerrit.wikimedia.org/r/476086 [18:59:14] (03CR) 10Elukey: [C: 032] matomo: disable unused features that cause load timeouts [puppet] - 10https://gerrit.wikimedia.org/r/476086 (owner: 10Elukey) [19:00:26] * legoktm hugs no_justification [19:05:49] (03PS1) 10Dzahn: icinga: if on passive host, replace icinga-downtime script with warning [puppet] - 10https://gerrit.wikimedia.org/r/476089 (https://phabricator.wikimedia.org/T210380) [19:07:26] (03PS2) 10Dzahn: icinga: if on passive host, replace icinga-downtime script with warning [puppet] - 10https://gerrit.wikimedia.org/r/476089 (https://phabricator.wikimedia.org/T210380) [19:10:14] (03CR) 10Brian Wolff: "Hi," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475947 (owner: 10Chad) [19:10:15] 10Operations, 10Release-Engineering-Team, 10Scap: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild - https://phabricator.wikimedia.org/T203625 (10greg) random question from from this task and T203664: should the debug hosts also be the same hardware/setup as the main clu... [19:14:29] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Update Debian Package for Scap to 3.8.10-1 - https://phabricator.wikimedia.org/T210469 (10greg) [19:15:38] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Backlog): Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10greg) [19:21:19] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-GIN - https://phabricator.wikimedia.org/T210299 (10Aboubacarkhoraa) D'accord @jcrespo L'essentiel pour moi qu'ont est une liste de diffusion pour notre groupe d'utilisateur wikimedia Guinée Conakry. [19:23:12] 10Operations, 10Release-Engineering-Team, 10Scap: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild - https://phabricator.wikimedia.org/T203625 (10Legoktm) Yeah, that seems sensible unless there's some significant reason (e.g. hardware cost) not to. [19:23:17] 10Operations, 10Traffic, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10greg) Just checking: this task is in the "active situation" column of the #wikimedia-incident project and has been open for a while. I see there are sub-tasks that look like follow-ups. Shoul... [19:25:28] 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10greg) [19:27:39] 10Operations, 10Traffic, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10BBlack) Seems reasonable to close this; the event itself is long over. There are still risks present for a followup event, but if we close up all the actionables that goes away eventually.... [19:28:26] 10Operations, 10Traffic, 10Wikimedia-Incident: Puppet doesn't restart ferm on failure - https://phabricator.wikimedia.org/T206951 (10greg) [19:29:16] 10Operations, 10Traffic, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10greg) 05Open>03Resolved a:03ayounsi Done, thanks! [19:31:27] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4778731, @daniel wrote: >>>! In T200297#4778378, @awight wrote: >>>>! In T200297#4777873, @Halfak wrote:... [19:33:35] (03CR) 10Cwhite: [C: 032] Scap: upgrade to 3.8.10-1 [puppet] - 10https://gerrit.wikimedia.org/r/476082 (https://phabricator.wikimedia.org/T210469) (owner: 10Thcipriani) [19:33:44] (03PS2) 10Cwhite: Scap: upgrade to 3.8.10-1 [puppet] - 10https://gerrit.wikimedia.org/r/476082 (https://phabricator.wikimedia.org/T210469) (owner: 10Thcipriani) [19:35:59] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) > I think we can support filtering by adding an index on the summary data? I was planning to do this unless there's a te... [19:36:48] 10Operations, 10ops-codfw, 10Patch-For-Review, 10Services (watching): rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10Papaul) papaul@asw-b-codfw> show interfaces ge-5/0/5 descriptions Interface Admin Link Description ge-5/0/5 up up restbase2... [19:40:13] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: no op sync wikiversions to test scap 3.8.10-1 [19:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:46] (03CR) 10Dzahn: [C: 04-1] "78 and 79 seem to be used twice in:" [dns] - 10https://gerrit.wikimedia.org/r/475939 (https://phabricator.wikimedia.org/T209615) (owner: 10Papaul) [19:44:53] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4778966, @daniel wrote: >> I think we can support filtering by adding an index on the summary data? I was... [19:47:16] (03CR) 10Dzahn: "i noticed the first machine has a different MAC vendor prefix from all others. the common one is known as Dell, the other one is unknown, " [puppet] - 10https://gerrit.wikimedia.org/r/476038 (https://phabricator.wikimedia.org/T209615) (owner: 10Papaul) [19:47:35] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Update Debian Package for Scap to 3.8.10-1 - https://phabricator.wikimedia.org/T210469 (10thcipriani) 05Open>03Resolved a:05thcipriani>03colewhite thanks @fgiunchedi for the initial 3.8.9-1 release (sorry for missing ini... [19:48:58] (03CR) 10Dzahn: [C: 032] PARTMAN: Add restbase201[3-8] [puppet] - 10https://gerrit.wikimedia.org/r/476040 (https://phabricator.wikimedia.org/T209615) (owner: 10Papaul) [19:49:08] (03PS2) 10Dzahn: PARTMAN: Add restbase201[3-8] [puppet] - 10https://gerrit.wikimedia.org/r/476040 (https://phabricator.wikimedia.org/T209615) (owner: 10Papaul) [19:51:57] 10Operations, 10Analytics, 10Performance-Team, 10Traffic: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10Krinkle) See also T194814, which this task could resolve. > x-analytics As I understand this, this field mainly exists to transmit data... [19:52:17] 10Operations, 10Traffic, 10media-storage, 10Performance-Team (Radar): Reduce amount of headers sent from web responses - https://phabricator.wikimedia.org/T194814 (10Krinkle) [19:56:56] (03PS2) 10Papaul: DNS: Add mgmt and production DNS entries for restbase201[3-8] [dns] - 10https://gerrit.wikimedia.org/r/475939 (https://phabricator.wikimedia.org/T209615) [20:00:05] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181127T2000) [20:06:43] 10Operations, 10ops-codfw, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10Papaul) [20:12:15] (03PS1) 10Dzahn: create profile::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) [20:12:56] (03PS3) 10Dzahn: DNS: Add mgmt and production DNS entries for restbase201[3-8] [dns] - 10https://gerrit.wikimedia.org/r/475939 (https://phabricator.wikimedia.org/T209615) (owner: 10Papaul) [20:13:06] (03CR) 10jerkins-bot: [V: 04-1] create profile::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [20:14:05] (03PS2) 10Dzahn: create profile::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) [20:16:32] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt and production DNS entries for restbase201[3-8] [dns] - 10https://gerrit.wikimedia.org/r/475939 (https://phabricator.wikimedia.org/T209615) (owner: 10Papaul) [20:21:22] (03PS3) 10Dzahn: create profile::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) [20:22:11] (03CR) 10jerkins-bot: [V: 04-1] create profile::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [20:29:50] (03PS6) 10Cwhite: initial commit [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) [20:30:49] (03PS2) 10Dzahn: DHCP: Add MAC address entries for restbase201[3-8] [puppet] - 10https://gerrit.wikimedia.org/r/476038 (https://phabricator.wikimedia.org/T209615) (owner: 10Papaul) [20:31:06] (03CR) 10Dzahn: [C: 032] DHCP: Add MAC address entries for restbase201[3-8] [puppet] - 10https://gerrit.wikimedia.org/r/476038 (https://phabricator.wikimedia.org/T209615) (owner: 10Papaul) [20:32:06] (03CR) 10Cwhite: [C: 031] Remove Diamond on additional DB roles [puppet] - 10https://gerrit.wikimedia.org/r/476001 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [20:35:31] (03PS4) 10Dzahn: create profile::article_recommender [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) [20:56:47] 10Operations, 10Analytics, 10Performance-Team, 10Traffic: Only serve debug HTTP headers when x-wikimedia-debug is present - https://phabricator.wikimedia.org/T210484 (10TheDJ) what about ?debug=true ? We already vary on that right ? might as well vary which set of headers is let true... [20:56:55] 10Operations, 10ops-codfw, 10Patch-For-Review, 10Services (watching): rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10Papaul) [21:02:16] 10Operations, 10Release-Engineering-Team, 10Scap: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild - https://phabricator.wikimedia.org/T203625 (10greg) >>! In T203625#4778902, @Legoktm wrote: >>>! In T203625#4778825, @greg wrote: >> random question from from this task an... [21:04:58] (03PS1) 10GTirloni: openstack: Move Keystone DB credentials to my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/476109 (https://phabricator.wikimedia.org/T210404) [21:05:14] (03CR) 10jerkins-bot: [V: 04-1] openstack: Move Keystone DB credentials to my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/476109 (https://phabricator.wikimedia.org/T210404) (owner: 10GTirloni) [21:07:46] (03PS2) 10GTirloni: openstack: Move Keystone DB credentials to my.cnf file [puppet] - 10https://gerrit.wikimedia.org/r/476109 (https://phabricator.wikimedia.org/T210404) [21:11:16] (03CR) 10SBassett: [C: 031] "Looks sane and would be helpful in analyzing recent incidents. Haven't tested locally, but appears to function nearly identically to the " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 (owner: 10Gergő Tisza) [21:14:00] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10greg) p:05Triage>03Normal [21:14:47] (03PS1) 10Thcipriani: Gerrit: Fix placement footer in edit view [puppet] - 10https://gerrit.wikimedia.org/r/476128 [21:16:42] (03CR) 10Dzahn: [C: 04-2] "these packages are already installed from analytics/cluster/packages/common.pp or other. so this is not needed, but i will recycle this ch" [puppet] - 10https://gerrit.wikimedia.org/r/476098 (https://phabricator.wikimedia.org/T208622) (owner: 10Dzahn) [21:18:21] (03CR) 10Dzahn: [C: 031] "i haven't actually reviewed the CSS but thank you for fixing this :)" [puppet] - 10https://gerrit.wikimedia.org/r/476128 (owner: 10Thcipriani) [21:19:28] (03CR) 10Paladox: [C: 031] Gerrit: Fix placement footer in edit view [puppet] - 10https://gerrit.wikimedia.org/r/476128 (owner: 10Thcipriani) [21:20:29] (03CR) 10Dzahn: [C: 032] Gerrit: Fix placement footer in edit view [puppet] - 10https://gerrit.wikimedia.org/r/476128 (owner: 10Thcipriani) [21:26:39] (03PS1) 10EBernhardson: Start wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476141 (https://phabricator.wikimedia.org/T209402) [21:27:26] (03CR) 10jerkins-bot: [V: 04-1] Start wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476141 (https://phabricator.wikimedia.org/T209402) (owner: 10EBernhardson) [21:28:21] (03PS1) 10Paladox: Gerrit: Fix footer conflicting with inline editor [puppet] - 10https://gerrit.wikimedia.org/r/476142 [21:28:46] (03PS2) 10Paladox: Gerrit: Fix footer conflicting with inline editor [puppet] - 10https://gerrit.wikimedia.org/r/476142 [21:32:09] 10Operations, 10Research, 10Patch-For-Review: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Dzahn) summary from IRC: - 2 new Gerrit repos have been requested, one regular one as deploy repo, code will move from github over there - instead of using pip Debian... [21:32:21] (03PS1) 10Thcipriani: Gerrit: use stronger selector for header height [puppet] - 10https://gerrit.wikimedia.org/r/476144 [21:33:38] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) > There's no prefix matching though, these are two tinyint fields holding a boolean each. Ah right, summaries are to b... [21:33:52] (03CR) 10EBernhardson: [C: 031] "now that i think about it, what happened here is labswiki was running zend php, and everything else was running hhvm. The result was labsw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475745 (https://phabricator.wikimedia.org/T198352) (owner: 10DCausse) [21:37:20] (03CR) 10Paladox: [C: 031] "Tested locally and works!" [puppet] - 10https://gerrit.wikimedia.org/r/476144 (owner: 10Thcipriani) [21:37:33] mutante thcipriani ^^ [21:37:37] tested locally and works! [21:38:01] (03Abandoned) 10Paladox: Gerrit: Fix footer conflicting with inline editor [puppet] - 10https://gerrit.wikimedia.org/r/476142 (owner: 10Paladox) [21:38:03] paladox: cant confirm though [21:38:14] mutante i can on https://gerrit.git.wmflabs.org/r/#/c/gci/+/2091/ [21:38:14] (03PS2) 10EBernhardson: Start wbsearchentities AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476141 (https://phabricator.wikimedia.org/T209402) [21:38:19] where i deployed the change! [21:38:56] (03CR) 10Dzahn: [C: 032] Gerrit: use stronger selector for header height [puppet] - 10https://gerrit.wikimedia.org/r/476144 (owner: 10Thcipriani) [21:39:07] i did as well.. but here we go [21:40:27] paladox: puppet was noop because i had already manually done it [21:40:34] heh [21:40:49] (03CR) 10EBernhardson: [C: 031] "LGTM. I've test-imported the remaining wikis into the new clusters and this set of "large" wikis results in expected data sizes on the new" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475746 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [21:41:37] paladox: so.. is it fixed for you in cobalt? [21:41:52] mutante yup [21:41:57] you need to clear your cache though [21:41:58] eh, ok :) [21:42:13] i used another change than before to avoid that. ok [21:44:03] yea, confirmed :) [21:45:57] (03CR) 10EBernhardson: [C: 031] "looks reasonable, minor snake_case comment" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475747 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [21:46:45] (03CR) 10EBernhardson: [C: 031] "we probably should have done this some time ago." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475748 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [21:55:43] (03PS3) 10Dzahn: icinga: if on passive host, replace icinga-downtime script with warning [puppet] - 10https://gerrit.wikimedia.org/r/476089 (https://phabricator.wikimedia.org/T210380) [21:56:39] (03CR) 10jerkins-bot: [V: 04-1] icinga: if on passive host, replace icinga-downtime script with warning [puppet] - 10https://gerrit.wikimedia.org/r/476089 (https://phabricator.wikimedia.org/T210380) (owner: 10Dzahn) [22:01:26] (03PS4) 10Dzahn: icinga: if on passive host, replace icinga-downtime script with warning [puppet] - 10https://gerrit.wikimedia.org/r/476089 (https://phabricator.wikimedia.org/T210380) [22:02:43] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) Thanks, this has been a helpful tangent! >>! In T200297#4779339, @daniel wrote: >> If you feel that it's very likely pe... [22:03:11] (03CR) 10Dzahn: [C: 032] icinga: if on passive host, replace icinga-downtime script with warning [puppet] - 10https://gerrit.wikimedia.org/r/476089 (https://phabricator.wikimedia.org/T210380) (owner: 10Dzahn) [22:03:22] (03PS5) 10Dzahn: icinga: if on passive host, replace icinga-downtime script with warning [puppet] - 10https://gerrit.wikimedia.org/r/476089 (https://phabricator.wikimedia.org/T210380) [22:11:12] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) > There should be no impact on revision pager queries until we have UI to filter on the new index. The danger with this... [22:12:38] (03PS1) 10Dzahn: icinga: fix syntax error in erb template for downtime script [puppet] - 10https://gerrit.wikimedia.org/r/476156 (https://phabricator.wikimedia.org/T210380) [22:13:28] (03PS2) 10Dzahn: icinga: fix syntax error in erb template for downtime script [puppet] - 10https://gerrit.wikimedia.org/r/476156 (https://phabricator.wikimedia.org/T210380) [22:16:32] 10Operations, 10ops-codfw, 10Patch-For-Review, 10Services (watching): rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10Papaul) [22:16:46] (03CR) 10Dzahn: [C: 032] icinga: fix syntax error in erb template for downtime script [puppet] - 10https://gerrit.wikimedia.org/r/476156 (https://phabricator.wikimedia.org/T210380) (owner: 10Dzahn) [22:22:30] 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Jdlrobson) @pmiazga anything left? Assuming we're no longer blocked? [22:23:32] 10Operations, 10monitoring, 10Patch-For-Review: Icinga downtime script should fail on the passive hosts - https://phabricator.wikimedia.org/T210380 (10Dzahn) @Volans After the changes above, now: [icinga1001:~] $ sudo icinga-downtime Usage: /usr/local/bin/icinga-downtime -h -d ... [22:27:51] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10BPirkle) [22:30:43] (03CR) 10EBernhardson: [cirrus] prepare multi-instance services (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [22:48:04] (03CR) 10EBernhardson: [cirrus] prepare multi-instance services (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [22:51:35] (03PS1) 10Dzahn: icinga: on passive host, display a warning in the motd [puppet] - 10https://gerrit.wikimedia.org/r/476171 (https://phabricator.wikimedia.org/T210380) [22:52:10] (03CR) 10jerkins-bot: [V: 04-1] icinga: on passive host, display a warning in the motd [puppet] - 10https://gerrit.wikimedia.org/r/476171 (https://phabricator.wikimedia.org/T210380) (owner: 10Dzahn) [22:52:38] (03PS2) 10Dzahn: icinga: on passive host, display a warning in the motd [puppet] - 10https://gerrit.wikimedia.org/r/476171 (https://phabricator.wikimedia.org/T210380) [22:56:27] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13742/icinga1001.wikimedia.org/change.icinga1001.wikimedia.org.pson" [puppet] - 10https://gerrit.wikimedia.org/r/476171 (https://phabricator.wikimedia.org/T210380) (owner: 10Dzahn) [23:00:27] 10Operations, 10monitoring, 10Patch-For-Review: Icinga downtime script should fail on the passive hosts - https://phabricator.wikimedia.org/T210380 (10Dzahn) 05Open>03Resolved @Volans and in addition to the script itself, i also added the big warning MOTD like on deployment or maintenance servers Notice... [23:15:23] 10Operations, 10ops-codfw, 10netops: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10Gehel) > @Gehel for wdqs2006 Depooling and downtime in Icinga should be good enough. There should be no user traffic on this server and updater will catch up on lag once connectivity is restored. [23:32:20] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:32:48] (03PS1) 10Dzahn: icinga: on passive host, icinga-downtime exit with code 127 [puppet] - 10https://gerrit.wikimedia.org/r/476178 (https://phabricator.wikimedia.org/T210380) [23:33:24] (03CR) 10jerkins-bot: [V: 04-1] icinga: on passive host, icinga-downtime exit with code 127 [puppet] - 10https://gerrit.wikimedia.org/r/476178 (https://phabricator.wikimedia.org/T210380) (owner: 10Dzahn) [23:33:44] (03PS2) 10Dzahn: icinga: on passive host, icinga-downtime exit with code 127 [puppet] - 10https://gerrit.wikimedia.org/r/476178 (https://phabricator.wikimedia.org/T210380) [23:34:42] (03PS3) 10Dzahn: icinga: on passive host, icinga-downtime exit with code 127 [puppet] - 10https://gerrit.wikimedia.org/r/476178 (https://phabricator.wikimedia.org/T210380) [23:36:34] (03CR) 10Dzahn: [C: 032] icinga: on passive host, icinga-downtime exit with code 127 [puppet] - 10https://gerrit.wikimedia.org/r/476178 (https://phabricator.wikimedia.org/T210380) (owner: 10Dzahn) [23:37:30] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:38:37] 10Operations, 10monitoring, 10Patch-For-Review: Icinga downtime script should fail on the passive hosts - https://phabricator.wikimedia.org/T210380 (10Dzahn) Made it exit with non-zero exit code as well to take the ticket literal "script should fail". It does now. [23:42:06] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:43:18] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 74114 bytes in 8.242 second response time