[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181108T0000). [00:00:04] MaxSem and James_F: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:09] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T208081 Disable the Petition extension in production (duration: 00m 52s) [00:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:11] T208081: Undeploy Petition extension - https://phabricator.wikimedia.org/T208081 [00:01:05] MaxSem: Ready? [00:01:08] uhu [00:01:21] (03CR) 10Jforrester: [C: 032] Enable wgMediaInTargetLanguage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472091 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:02:32] (03PS1) 10Niedzielski: Enable Wikibase PageRandomLookup unexpected page_random value logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472356 (https://phabricator.wikimedia.org/T208796) [00:02:47] (03PS2) 10Jforrester: Enable wgMediaInTargetLanguage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472091 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:02:52] (03CR) 10Jforrester: [C: 032] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472091 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:03:17] (03PS2) 10Jforrester: Drop the Petition extension: Part IV - Drop from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472216 (https://phabricator.wikimedia.org/T208081) [00:03:23] (03CR) 10Jforrester: [C: 032] Drop the Petition extension: Part IV - Drop from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472216 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [00:04:16] (03Merged) 10jenkins-bot: Enable wgMediaInTargetLanguage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472091 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:04:21] Yay. [00:04:22] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.398 second response time [00:04:33] (03Merged) 10jenkins-bot: Drop the Petition extension: Part IV - Drop from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472216 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [00:04:43] MaxSem: Live on mwdebug1002, please test. [00:05:02] (03PS2) 10Jforrester: Drop the Petition extension: Part V - Drop from InitSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472217 (https://phabricator.wikimedia.org/T208081) [00:05:11] (03CR) 10Jforrester: [C: 032] Drop the Petition extension: Part V - Drop from InitSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472217 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [00:05:20] (03PS2) 10Jforrester: Drop the Petition extension: Part VI - Drop i18n load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472218 (https://phabricator.wikimedia.org/T208081) [00:05:25] (03CR) 10Jforrester: [C: 032] Drop the Petition extension: Part VI - Drop i18n load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472218 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [00:06:39] (03Merged) 10jenkins-bot: Drop the Petition extension: Part V - Drop from InitSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472217 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [00:06:42] (03Merged) 10jenkins-bot: Drop the Petition extension: Part VI - Drop i18n load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472218 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [00:07:42] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:07:53] (03PS1) 10Dzahn: monitoring: add systemd path in .fixtures.yml for failing tests [puppet] - 10https://gerrit.wikimedia.org/r/472357 [00:08:23] RoanKattouw: Sorry the merge is taking so long. [00:08:42] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add systemd path in .fixtures.yml for failing tests [puppet] - 10https://gerrit.wikimedia.org/r/472357 (owner: 10Dzahn) [00:09:51] (03PS1) 10Jforrester: [GovernanceWiki] Enable BotPasswords, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472358 (https://phabricator.wikimedia.org/T205368) [00:09:54] (03CR) 10Dzahn: [C: 032] "this should hopefully fix tests on unrelated changes like Change-Id" [puppet] - 10https://gerrit.wikimedia.org/r/472357 (owner: 10Dzahn) [00:10:43] (03CR) 10Dzahn: [C: 032] "like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/472033/ but for the monitoring module vs. nagios_common" [puppet] - 10https://gerrit.wikimedia.org/r/472357 (owner: 10Dzahn) [00:10:44] James_F: WFM [00:11:03] MaxSem: Cool, syncing. [00:11:05] (03CR) 10Cwhite: icinga: fix path to retention.dat file on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:12:08] (03CR) 10Dzahn: [V: 032 C: 032] "removing -1 due to " Failure/Error: require ::systemd" because that's also what this should fix :)" [puppet] - 10https://gerrit.wikimedia.org/r/472357 (owner: 10Dzahn) [00:12:17] (03CR) 10CDanis: [V: 031 C: 031] monitoring: add systemd path in .fixtures.yml for failing tests [puppet] - 10https://gerrit.wikimedia.org/r/472357 (owner: 10Dzahn) [00:12:27] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T208899 Enabling wgMediaInTargetLanguage for testwiki (duration: 00m 54s) [00:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:29] T208899: Rollout SVGs in page language - https://phabricator.wikimedia.org/T208899 [00:12:31] (03CR) 10jenkins-bot: Enable wgMediaInTargetLanguage on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472091 (https://phabricator.wikimedia.org/T208899) (owner: 10MaxSem) [00:12:33] (03CR) 10jenkins-bot: Drop the Petition extension: Part IV - Drop from CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472216 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [00:12:35] (03CR) 10jenkins-bot: Drop the Petition extension: Part V - Drop from InitSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472217 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [00:12:37] (03CR) 10jenkins-bot: Drop the Petition extension: Part VI - Drop i18n load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472218 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [00:14:26] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T208081 Drop the Petition extension from CommonSettings (duration: 00m 53s) [00:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:30] T208081: Undeploy Petition extension - https://phabricator.wikimedia.org/T208081 [00:16:03] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T208081 Drop the Petition extension from InitialiseSettings (duration: 00m 52s) [00:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:24] (03PS1) 10Catrope: GrowthExperiments: Enable WelcomeSurvey on English and Korean beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472359 (https://phabricator.wikimedia.org/T208449) [00:18:21] (03CR) 10Jforrester: [C: 032] GrowthExperiments: Enable WelcomeSurvey on English and Korean beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472359 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [00:18:35] !log jforrester@deploy1001 Synchronized wmf-config/extension-list: T208081 Drop the Petition extension from extension-list (duration: 00m 53s) [00:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:34] (03Merged) 10jenkins-bot: GrowthExperiments: Enable WelcomeSurvey on English and Korean beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472359 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [00:19:52] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:20:24] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:21:07] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T208449 Disable wgWelcomeSurveyEnabled everywhere in production (duration: 00m 54s) [00:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:21] T208449: Deploy GrowthExperiments extension to beta - https://phabricator.wikimedia.org/T208449 [00:21:50] \o/ [00:25:28] (03CR) 10Dzahn: "cdanis pointed me to this after i did https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/472357/ .. rebasing :)" [puppet] - 10https://gerrit.wikimedia.org/r/472203 (owner: 10Giuseppe Lavagetto) [00:25:48] (03PS2) 10Dzahn: monitoring: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/472203 (owner: 10Giuseppe Lavagetto) [00:26:13] (03CR) 10Dzahn: [C: 032] monitoring: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/472203 (owner: 10Giuseppe Lavagetto) [00:26:19] !log Created the bot_passwords table for Governance wiki T205368 [00:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:22] T205368: Enable bot passwords on Governance wiki - https://phabricator.wikimedia.org/T205368 [00:26:48] (03PS2) 10Jforrester: [GovernanceWiki] Enable BotPasswords, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472358 (https://phabricator.wikimedia.org/T205368) [00:27:05] (03CR) 10Dzahn: [V: 032 C: 032] "also see: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/472203/" [puppet] - 10https://gerrit.wikimedia.org/r/472357 (owner: 10Dzahn) [00:27:58] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:28:00] RoanKattouw: I673f59d936 is on mwdebug1002 (for wmf.3). [00:28:29] (03CR) 10Jforrester: [C: 032] [GovernanceWiki] Enable BotPasswords, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472358 (https://phabricator.wikimedia.org/T205368) (owner: 10Jforrester) [00:29:56] (03Merged) 10jenkins-bot: [GovernanceWiki] Enable BotPasswords, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472358 (https://phabricator.wikimedia.org/T205368) (owner: 10Jforrester) [00:30:45] (03CR) 10Dzahn: [C: 032] "thank you! i had not realized that's how i broke it myself. rebased, merged , ran "recheck" here and this works now: for example V+2 on " [puppet] - 10https://gerrit.wikimedia.org/r/472203 (owner: 10Giuseppe Lavagetto) [00:32:58] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T205368 Enable BotPasswords on Governance wiki (duration: 00m 55s) [00:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:01] T205368: Enable bot passwords on Governance wiki - https://phabricator.wikimedia.org/T205368 [00:35:07] (03CR) 10jenkins-bot: GrowthExperiments: Enable WelcomeSurvey on English and Korean beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472359 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [00:35:09] (03CR) 10jenkins-bot: [GovernanceWiki] Enable BotPasswords, take two. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472358 (https://phabricator.wikimedia.org/T205368) (owner: 10Jforrester) [00:36:42] RoanKattouw: Any luck? [00:39:31] James_F: Sorry, my Lua is rusty [00:39:58] RoanKattouw: Better than your Rust being leery. [00:40:06] Got it : https://test.wikipedia.org/wiki/UTF8-test [00:40:19] (03CR) 10Dzahn: [C: 04-2] "there is already Stdlib::Filemode. reinventing the wheel" [puppet] - 10https://gerrit.wikimedia.org/r/471326 (owner: 10Dzahn) [00:40:31] That should stop throwing an exception (and instead just have broken JS) with my patch [00:40:44] Indeed. [00:40:51] James_F: Yay it works [00:41:00] It also puts the exception message in the console as intended [00:42:29] RoanKattouw: Indeed. Syncing now. [00:43:13] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.3/includes/resourceloader/ResourceLoader.php: ResourceLoader: Fail less hard when JSON serialization of config fails I673f59d93 (duration: 00m 53s) [00:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:36] OK, SWAT over. [00:46:12] 10 patches and two DB maintenance scripts in 50 minutes. Fun. [00:51:05] (03PS2) 10Dzahn: icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) [00:51:25] (03CR) 10Dzahn: icinga: fix path to retention.dat file on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:52:01] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:56:57] (03PS1) 10Dzahn: stdlib: import useful data types (filemode,filesource,fqdn,host,port) [puppet] - 10https://gerrit.wikimedia.org/r/472363 [00:58:12] (03CR) 10Dzahn: "i recently attempted to create "Filemode" myself from scratch at https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/471326/ but the i " [puppet] - 10https://gerrit.wikimedia.org/r/472363 (owner: 10Dzahn) [01:00:04] twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181108T0100). [01:02:18] (03Abandoned) 10Dzahn: stdlib: add a data type for Unix octal file mode [puppet] - 10https://gerrit.wikimedia.org/r/471326 (owner: 10Dzahn) [01:06:22] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.464 second response time [01:06:50] (03PS3) 10Dzahn: icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) [01:07:53] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:10:01] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:01] (03PS4) 10Dzahn: icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) [01:12:54] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:13:21] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.949 second response time [01:14:07] (03PS7) 10Mathew.onipe: wdqs: separation of concerns [puppet] - 10https://gerrit.wikimedia.org/r/471665 (https://phabricator.wikimedia.org/T208394) [01:14:28] (03PS1) 10Bstorm: sonofgridengine: adapt more code for bastions on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472364 (https://phabricator.wikimedia.org/T200557) [01:14:38] (03CR) 10Mathew.onipe: wdqs: separation of concerns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/471665 (https://phabricator.wikimedia.org/T208394) (owner: 10Mathew.onipe) [01:16:18] (03PS2) 10Bstorm: sonofgridengine: adapt more code for bastions on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472364 (https://phabricator.wikimedia.org/T200557) [01:16:41] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:16:47] (03PS5) 10Dzahn: icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) [01:17:37] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:18:42] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.006 second response time [01:18:44] !log scb1004 - systemctl restart pdfrender (T174916) [01:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:47] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [01:21:05] (03PS6) 10Dzahn: icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) [01:22:46] (03CR) 10Bstorm: [C: 032] sonofgridengine: adapt more code for bastions on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472364 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [01:27:10] (03PS7) 10Dzahn: icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) [01:28:00] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:30:10] (03PS8) 10Dzahn: icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) [01:30:59] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:33:42] (03PS9) 10Dzahn: icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) [01:34:32] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:35:56] (03PS1) 10Bstorm: sonofgridengine: Make comment more explicit in dev_environ [puppet] - 10https://gerrit.wikimedia.org/r/472367 [01:37:54] (03PS10) 10Dzahn: icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) [01:38:59] (03PS1) 10Bstorm: sonofgridengine: remove cruft comment [puppet] - 10https://gerrit.wikimedia.org/r/472369 [01:39:52] (03CR) 10Bstorm: [C: 032] sonofgridengine: remove cruft comment [puppet] - 10https://gerrit.wikimedia.org/r/472369 (owner: 10Bstorm) [01:41:09] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13394/" [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:41:18] (03PS2) 10Bstorm: sonofgridengine: Make comment more explicit in dev_environ [puppet] - 10https://gerrit.wikimedia.org/r/472367 [01:42:17] (03CR) 10Bstorm: [C: 032] sonofgridengine: Make comment more explicit in dev_environ [puppet] - 10https://gerrit.wikimedia.org/r/472367 (owner: 10Bstorm) [01:42:51] (03CR) 10Dzahn: [C: 031] "note how currently on icinga1001 we have a retention.dat existing in both of these locations, one as the local one and one as the synced o" [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [02:20:41] PROBLEM - Host rdb1010 is DOWN: PING CRITICAL - Packet loss = 100% [02:22:11] RECOVERY - Host rdb1010 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [02:26:53] 10Operations, 10ORES, 10Scoring-platform-team, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 (10Halfak) I like the proposal of depooling one datacenter. What do you think @akosiaris? Is this crazy? [02:32:38] 10Operations, 10ORES, 10Scoring-platform-team, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 (10awight) Another thing to consider is that, although we're all very curious about our ceiling, it doesn't really matter until we see real traf... [03:29:01] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 803.86 seconds [03:35:40] Hi, I’m getting “Phabricator is in read-only mode (unreachable master).” [03:35:43] On phab [03:37:11] twentyafterfour: ^^ [03:37:25] Works now [03:43:48] (03CR) 10Legoktm: [C: 04-1] Enable Wikibase PageRandomLookup unexpected page_random value logging (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472356 (https://phabricator.wikimedia.org/T208796) (owner: 10Niedzielski) [04:19:52] (03PS2) 10Niedzielski: Enable Wikibase PageRandomLookup unexpected page_random value logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472356 (https://phabricator.wikimedia.org/T208796) [04:20:03] (03CR) 10Niedzielski: Enable Wikibase PageRandomLookup unexpected page_random value logging (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472356 (https://phabricator.wikimedia.org/T208796) (owner: 10Niedzielski) [05:12:30] * bawolff is going to go deploy a security patch [05:30:26] !log deployed patch T208881 [05:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:32] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:29:01] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/conf-available/00-fcgi-headers.conf] [06:29:52] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R] [06:34:33] !log bawolff@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/OpenStackManager/special/SpecialNovaSudoer.php: T203885 (duration: 00m 54s) [06:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:11] 10Operations, 10Cloud-VPS, 10Wikidata, 10Wikidata-Query-Service, and 2 others: WDQS tests can no longer edit test.wikidata.org - https://phabricator.wikimedia.org/T208986 (10Smalyshev) p:05Triage>03High [06:59:12] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:32] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:00:22] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.022 second response time [07:00:32] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:01:51] (03PS3) 10Giuseppe Lavagetto: package_builder: add ability to add the php72 component on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472136 (https://phabricator.wikimedia.org/T208433) [07:03:51] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.006 second response time [07:05:02] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 272.44 seconds [07:05:06] (03CR) 10Giuseppe Lavagetto: package_builder: add ability to add the php72 component on stretch (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/472136 (https://phabricator.wikimedia.org/T208433) (owner: 10Giuseppe Lavagetto) [07:05:18] (03CR) 10Giuseppe Lavagetto: [C: 032] package_builder: add ability to add the php72 component on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472136 (https://phabricator.wikimedia.org/T208433) (owner: 10Giuseppe Lavagetto) [07:10:27] (03PS1) 10Giuseppe Lavagetto: package_builder: fixup for Ia71dc8c9 [puppet] - 10https://gerrit.wikimedia.org/r/472382 [07:11:16] (03CR) 10Giuseppe Lavagetto: [C: 032] package_builder: fixup for Ia71dc8c9 [puppet] - 10https://gerrit.wikimedia.org/r/472382 (owner: 10Giuseppe Lavagetto) [07:26:58] (03PS1) 10Giuseppe Lavagetto: package_builder: actually update apt in the php72 hook [puppet] - 10https://gerrit.wikimedia.org/r/472383 [07:27:21] (03CR) 10Giuseppe Lavagetto: [C: 032] package_builder: actually update apt in the php72 hook [puppet] - 10https://gerrit.wikimedia.org/r/472383 (owner: 10Giuseppe Lavagetto) [07:45:16] (03PS3) 10Ema: cache: add cp2006 and cp2012 to cache::text [puppet] - 10https://gerrit.wikimedia.org/r/472130 (https://phabricator.wikimedia.org/T208588) [07:46:34] (03CR) 10Ema: [C: 032] cache: add cp2006 and cp2012 to cache::text [puppet] - 10https://gerrit.wikimedia.org/r/472130 (https://phabricator.wikimedia.org/T208588) (owner: 10Ema) [07:57:51] (03CR) 10Gehel: "Nice! Would it make more sense to upgrade to a more recent version of stdlib instead of picking just pieces of it?" [puppet] - 10https://gerrit.wikimedia.org/r/472363 (owner: 10Dzahn) [07:58:58] 10Operations, 10Traffic, 10Patch-For-Review: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` ['cp2006.codfw.wmnet', 'cp2012.codfw.wmnet'] ``` The log can be found in `/va... [08:10:13] (03CR) 10Muehlenhoff: [C: 032] Fix CVE-2018-16843 CVE-2018-16844 CVE-2018-16845 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/472143 (owner: 10Muehlenhoff) [08:13:33] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), and 4 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) >>! In T203786#4729548, @aaron wrote: > Keys are set by add/... [08:13:39] 10Operations, 10Traffic, 10Patch-For-Review: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2006.codfw.wmnet', 'cp2012.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['cp2006.codfw.wmnet', 'cp2012.codfw.wm... [08:15:58] 10Operations, 10Traffic: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) [08:16:05] 10Operations, 10Traffic: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) p:05Triage>03Normal [08:16:18] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [08:16:50] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [08:16:53] 10Operations, 10Traffic: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) [08:17:02] 10Operations, 10Traffic: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) [08:19:45] (03PS2) 10Muehlenhoff: Remove Diamond from Matomo [puppet] - 10https://gerrit.wikimedia.org/r/472109 (https://phabricator.wikimedia.org/T183454) [08:21:35] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from Matomo [puppet] - 10https://gerrit.wikimedia.org/r/472109 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [08:35:32] (03PS2) 10Muehlenhoff: mediawiki: Remove obsolete Firejail profile [puppet] - 10https://gerrit.wikimedia.org/r/472128 [08:40:32] (03PS1) 10Ema: cacheproxy: only call cron_splay() for hosts in $all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/472384 (https://phabricator.wikimedia.org/T208588) [08:47:11] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:47:26] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [08:48:11] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [08:48:41] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:51:41] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:19] (03CR) 10Banyek: [C: 031] "I am using cumin hosts now, so I can +1 it as they are working replcatement, but I don't know jcrespo or marostegui" [puppet] - 10https://gerrit.wikimedia.org/r/466833 (owner: 10Muehlenhoff) [08:56:03] (03CR) 10Banyek: [C: 031] Remove Diamond from DB roles [puppet] - 10https://gerrit.wikimedia.org/r/467264 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [08:56:59] 10Operations, 10Operations-Software-Development: Systemd session creation fails under I/O load - https://phabricator.wikimedia.org/T199911 (10fgiunchedi) Current bandaid I'm using: `systemctl reset-failed $(systemctl show --failed -p Id --value *.scope)` [08:57:11] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational [09:06:41] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 5.973 second response time [09:10:11] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:10:45] (03PS3) 10Muehlenhoff: mediawiki: Remove obsolete Firejail profile [puppet] - 10https://gerrit.wikimedia.org/r/472128 [09:11:52] (03CR) 10Muehlenhoff: [C: 032] mediawiki: Remove obsolete Firejail profile [puppet] - 10https://gerrit.wikimedia.org/r/472128 (owner: 10Muehlenhoff) [09:15:21] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/firejail/mediawiki-imagemagick.profile] [09:15:32] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/firejail/mediawiki-imagemagick.profile] [09:15:42] <_joe_> moritzm: ^^ [09:15:47] <_joe_> seems like it's still used? [09:16:02] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/firejail/mediawiki-imagemagick.profile] [09:16:37] looking [09:16:42] PROBLEM - puppet last run on mw2254 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/firejail/mediawiki-imagemagick.profile] [09:18:33] it's just puppet being puppet [09:18:42] PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/firejail/mediawiki-imagemagick.profile] [09:19:32] PROBLEM - puppet last run on mw2273 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/firejail/mediawiki-imagemagick.profile] [09:19:50] 10Operations, 10Citoid, 10Services (watching): Support meta tag refresh redirects in citoid to support elsevier's linking hub - https://phabricator.wikimedia.org/T204032 (10Mvolz) [09:20:00] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Mvolz) [09:20:22] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:20:30] (03PS1) 10Filippo Giunchedi: hieradata: rollout rsyslog_exporter in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/472395 (https://phabricator.wikimedia.org/T205849) [09:20:32] (03PS1) 10Filippo Giunchedi: hieradata: rollout rsyslog_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/472396 (https://phabricator.wikimedia.org/T205849) [09:20:42] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:21:51] RECOVERY - puppet last run on mw2254 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:23:42] RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:25:52] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.437 second response time [09:26:21] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:28:20] (03CR) 10Ema: [C: 031] hieradata: rollout rsyslog_exporter in ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472395 (https://phabricator.wikimedia.org/T205849) (owner: 10Filippo Giunchedi) [09:29:05] !log temporarily set elasticsearch logstash watermark to low:0.85 and high:0.9 [09:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:22] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:29:30] !log installing curl security updates [09:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:33] (03PS6) 10Filippo Giunchedi: WIP: temporary workaround co-installability of two roles [puppet] - 10https://gerrit.wikimedia.org/r/470346 [09:30:35] (03PS2) 10Filippo Giunchedi: hieradata: rollout rsyslog_exporter in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/472395 (https://phabricator.wikimedia.org/T205849) [09:30:37] (03PS2) 10Filippo Giunchedi: hieradata: rollout rsyslog_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/472396 (https://phabricator.wikimedia.org/T205849) [09:37:30] !log keep 2x not 3x copies of older (>15d) logstash elasticsearch indices [09:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:39] godog: it's sales season? [09:39:43] (03PS1) 10Elukey: Revert "mcrouter::shards: depool mc2029" [puppet] - 10https://gerrit.wikimedia.org/r/472397 [09:39:46] :p [09:40:11] jiji: I wish! [09:40:30] (03CR) 10Filippo Giunchedi: hieradata: rollout rsyslog_exporter in ulsfo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472395 (https://phabricator.wikimedia.org/T205849) (owner: 10Filippo Giunchedi) [09:45:11] RECOVERY - puppet last run on mw2273 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:45:51] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 8.023 second response time [09:46:15] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: rollout rsyslog_exporter in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/472395 (https://phabricator.wikimedia.org/T205849) (owner: 10Filippo Giunchedi) [09:46:24] (03PS3) 10Filippo Giunchedi: hieradata: rollout rsyslog_exporter in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/472395 (https://phabricator.wikimedia.org/T205849) [09:49:12] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:02:22] !log installing ppp security updates on trusty [10:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:05] (03CR) 10Elukey: [C: 032] Update hive parquet log destination [puppet/cdh] - 10https://gerrit.wikimedia.org/r/471928 (https://phabricator.wikimedia.org/T208550) (owner: 10Joal) [10:04:32] (03PS1) 10Elukey: Update the cdh module to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/472398 [10:04:54] (03CR) 10Elukey: [V: 032 C: 032] Update the cdh module to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/472398 (owner: 10Elukey) [10:05:11] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 4.767 second response time [10:07:51] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:08:41] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:10:52] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1004.eqiad.wmnet are marked down but pooled [10:11:52] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - pdfrender_5252: Servers scb1003.eqiad.wmnet are marked down but pooled [10:14:06] PROBLEM - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:23] ouch [10:14:25] looking [10:14:31] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.351 second response time [10:14:32] <_joe_> it finally paged [10:14:45] <_joe_> volans: jiji should be looking [10:14:55] <_joe_> please coordinate [10:14:57] _joe_: ack, then I'll not restart it [10:17:33] <_joe_> volans: I'm not sure she is doing it right now, I might have misunderstood [10:17:41] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:55] _joe_: let me restart a couple of them first [10:18:01] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:18:03] <_joe_> I'm restarting 1001 [10:18:09] I was doing it :D [10:18:23] <_joe_> !log restarting pdfrender on scb1001 [10:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:41] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.005 second response time [10:18:45] !log restarting pdfrender on scb1002 [10:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:02] !log restarting pdfrender on scb1004 [10:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:30] <_joe_> sigh [10:19:43] <_joe_> volans: still opposed to daily restarts of this junk? [10:19:45] RECOVERY - LVS HTTP IPv4 on pdfrender.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [10:19:51] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [10:19:51] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [10:20:02] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [10:20:11] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [10:20:29] _joe_: I wasn't opposed... it's just super meh :( but yeah better than this [10:22:07] <_joe_> volans: yeah it's horrible [10:22:28] <_joe_> but given the constraints, it feels like the best option to me [10:22:33] +1 [10:22:38] I'll restart on scb1003 [10:23:22] !log restarting pdfrender on scb1003 [10:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:19] volans: we could do every 2 days [10:25:01] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [10:25:08] _joe_ should I postpone the mc2029 repool (I was about to do it) [10:25:09] ? [10:25:14] <_joe_> elukey: no, why? [10:25:25] (03PS2) 10Elukey: Revert "mcrouter::shards: depool mc2029" [puppet] - 10https://gerrit.wikimedia.org/r/472397 [10:25:42] <_joe_> jiji: 1 day is more than enough, if we carefully splay across the cluster [10:25:53] not sure about if you guys wanted maintenance with ongoing issues with pdfrender, sounds not :) [10:25:54] <_joe_> we do it for other services too [10:26:03] <_joe_> elukey: it's pdfrender, we' [10:26:10] !log restart memcached on mc2029 (was depooled yesterday for network maintenance) [10:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:16] <_joe_> re the only ones caring about it [10:26:34] <_joe_> I don't know why "OCG" comes back to my mind [10:26:58] (03CR) 10Elukey: [C: 032] Revert "mcrouter::shards: depool mc2029" [puppet] - 10https://gerrit.wikimedia.org/r/472397 (owner: 10Elukey) [10:27:43] (03PS4) 10Muehlenhoff: Use auto_ferm for nfs::misc rsyncd modules [puppet] - 10https://gerrit.wikimedia.org/r/467985 [10:27:44] _joe_ your subconscious clearly misses it [10:28:27] (03CR) 10Muehlenhoff: [C: 032] Use auto_ferm for nfs::misc rsyncd modules [puppet] - 10https://gerrit.wikimedia.org/r/467985 (owner: 10Muehlenhoff) [10:28:30] <_joe_> elukey: or I'm stuck in the same situation again? [10:28:37] <_joe_> pdf groundhog day [10:29:14] <_joe_> a new movie where a grumpy SRE wakes up every morning to an unreliable pdf rendered no one cares about but him, and the users [10:29:19] (03PS1) 10Gilles: Expose Varnish X-Cache-Status via Server-Timing [puppet] - 10https://gerrit.wikimedia.org/r/472401 (https://phabricator.wikimedia.org/T207862) [10:29:40] <_joe_> it's gonna be a fun movie, not sure about who can be the main character though [10:29:55] I don't know how any grumpy SRE on top of my mind [10:29:57] (03PS3) 10Filippo Giunchedi: hieradata: rollout rsyslog_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/472396 (https://phabricator.wikimedia.org/T205849) [10:29:59] (03PS1) 10Filippo Giunchedi: hieradata: rollout rsyslog_exporter in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/472403 (https://phabricator.wikimedia.org/T205849) [10:30:01] (03PS1) 10Filippo Giunchedi: hieradata: rollout rsyslog_exporter in esams [puppet] - 10https://gerrit.wikimedia.org/r/472404 (https://phabricator.wikimedia.org/T205849) [10:31:23] (running puppet on the mw2*) [10:32:23] (03CR) 10GTirloni: [C: 032] Nova: add cloudvirt1017 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/472253 (https://phabricator.wikimedia.org/T208733) (owner: 10Andrew Bogott) [10:32:35] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: Collect Backend-Timing in Prometheus - https://phabricator.wikimedia.org/T131894 (10Gilles) 05Open>03Resolved The basic functionality is there. If we want to iterate on that, it should be the subject of a new task. [10:32:39] (03PS2) 10GTirloni: Nova: add cloudvirt1017 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/472253 (https://phabricator.wikimedia.org/T208733) (owner: 10Andrew Bogott) [10:35:45] (03PS1) 10Banyek: mariadb: depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472405 (https://phabricator.wikimedia.org/T208954) [10:37:01] (03CR) 10jerkins-bot: [V: 04-1] mariadb: depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472405 (https://phabricator.wikimedia.org/T208954) (owner: 10Banyek) [10:38:34] (03PS2) 10Banyek: mariadb: depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472405 (https://phabricator.wikimedia.org/T208954) [10:39:55] jouncebot: now [10:39:55] No deployments scheduled for the next 1 hour(s) and 20 minute(s) [10:40:01] woo [10:41:24] (03PS1) 10Filippo Giunchedi: prometheus: add rsyslog_exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/472408 (https://phabricator.wikimedia.org/T205849) [10:47:42] \o all [10:47:45] (03PS1) 10Addshore: Disable wmgUseTwoColConflict everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472409 (https://phabricator.wikimedia.org/T209012) [10:47:46] * addshore is going to deploy this ^^ now [10:48:09] (03CR) 10Addshore: [C: 032] Disable wmgUseTwoColConflict everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472409 (https://phabricator.wikimedia.org/T209012) (owner: 10Addshore) [10:49:26] (03Merged) 10jenkins-bot: Disable wmgUseTwoColConflict everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472409 (https://phabricator.wikimedia.org/T209012) (owner: 10Addshore) [10:49:41] (03CR) 10jenkins-bot: Disable wmgUseTwoColConflict everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472409 (https://phabricator.wikimedia.org/T209012) (owner: 10Addshore) [10:49:52] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add rsyslog_exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/472408 (https://phabricator.wikimedia.org/T205849) (owner: 10Filippo Giunchedi) [10:50:02] (03PS2) 10Filippo Giunchedi: prometheus: add rsyslog_exporter jobs [puppet] - 10https://gerrit.wikimedia.org/r/472408 (https://phabricator.wikimedia.org/T205849) [10:50:24] (03PS1) 10Elukey: aptrepo: update cloudera cdh to 5.15 [puppet] - 10https://gerrit.wikimedia.org/r/472410 (https://phabricator.wikimedia.org/T204759) [10:52:29] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable wmgUseTwoColConflict everywhere T209012 T208840 T195724 (duration: 00m 58s) [10:52:32] !log Reimaging rdb1006 to stretch - T206450 [10:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:35] T208840: Monitor user reports about imaginary conflicts - https://phabricator.wikimedia.org/T208840 [10:52:35] T195724: Adapt Save-, Preview- and Cancel buttons to new version (No. 8) - https://phabricator.wikimedia.org/T195724 [10:52:36] T209012: Preview on a new site breaks and leads to loose of the site content - https://phabricator.wikimedia.org/T209012 [10:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:38] T206450: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 [10:52:43] * addshore is all done [10:54:59] (03PS1) 10Thiemo Kreuz (WMDE): Revert "Disable wmgUseTwoColConflict everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472411 (https://phabricator.wikimedia.org/T205942) [10:55:54] (03CR) 10Volans: "Should I validate also the choice of db1089 as temporary vslow/dump or that was already agreed upon?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472405 (https://phabricator.wikimedia.org/T208954) (owner: 10Banyek) [10:57:24] (03PS2) 10GTirloni: network::constants labs: add missing redundant project-proxy host [puppet] - 10https://gerrit.wikimedia.org/r/472021 (owner: 10Alex Monk) [10:57:54] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Simplified test name in MWMultiVersionTest (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472157 (owner: 10WMDE-leszek) [10:58:37] (03CR) 10GTirloni: [C: 032] "Good catch, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/472021 (owner: 10Alex Monk) [11:01:30] (03PS2) 10GTirloni: tools - Remove temporary /var/mail fix [puppet] - 10https://gerrit.wikimedia.org/r/471937 (https://phabricator.wikimedia.org/T208843) [11:05:33] !log draining ganeti2008 for reboot/kernel security update [11:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:11] (03PS1) 10Effie Mouzeli: Change rdb1005 and rdb1006 to redis::misc master/slave [puppet] - 10https://gerrit.wikimedia.org/r/472412 (https://phabricator.wikimedia.org/T206450) [11:06:50] (03CR) 10jerkins-bot: [V: 04-1] Change rdb1005 and rdb1006 to redis::misc master/slave [puppet] - 10https://gerrit.wikimedia.org/r/472412 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [11:12:21] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:14:41] (03PS2) 10Effie Mouzeli: Change rdb1005 and rdb1006 to redis::misc master/slave [puppet] - 10https://gerrit.wikimedia.org/r/472412 (https://phabricator.wikimedia.org/T206450) [11:15:18] (03CR) 10jerkins-bot: [V: 04-1] Change rdb1005 and rdb1006 to redis::misc master/slave [puppet] - 10https://gerrit.wikimedia.org/r/472412 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [11:16:07] (03PS1) 10Filippo Giunchedi: WIP swift-reformat-device [puppet] - 10https://gerrit.wikimedia.org/r/472414 [11:16:09] (03PS1) 10Filippo Giunchedi: swift: disable free inode btree at mkfs time [puppet] - 10https://gerrit.wikimedia.org/r/472415 (https://phabricator.wikimedia.org/T199198) [11:17:53] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) I've finished rebuilding the filesystems 12x on 8x hosts affected, no reoccurrence has been observed since. [11:18:10] !log draining ganeti2007 for reboot/kernel security update [11:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:53] (03CR) 10Banyek: "please validate it, I choose it, but I didn't get confirmation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472405 (https://phabricator.wikimedia.org/T208954) (owner: 10Banyek) [11:20:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Relabel labvirt1017.eqiad.wmnet as cloudvirt1017.eqiad.wmnet - https://phabricator.wikimedia.org/T208945 (10aborrero) [11:21:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Relabel labvirt1017.eqiad.wmnet as cloudvirt1017.eqiad.wmnet - https://phabricator.wikimedia.org/T208945 (10aborrero) [11:21:14] (03PS3) 10Effie Mouzeli: Change rdb1005 and rdb1006 to redis::misc master/slave [puppet] - 10https://gerrit.wikimedia.org/r/472412 (https://phabricator.wikimedia.org/T206450) [11:22:36] (03CR) 10Giuseppe Lavagetto: [C: 031] Change rdb1005 and rdb1006 to redis::misc master/slave [puppet] - 10https://gerrit.wikimedia.org/r/472412 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [11:25:52] (03CR) 10Effie Mouzeli: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13405/rdb1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/472412 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [11:30:51] (03CR) 10Dr0ptp4kt: "Done." [puppet] - 10https://gerrit.wikimedia.org/r/471257 (https://phabricator.wikimedia.org/T208795) (owner: 10Dr0ptp4kt) [11:31:33] (03CR) 10GTirloni: [C: 032] tools - Remove temporary /var/mail fix [puppet] - 10https://gerrit.wikimedia.org/r/471937 (https://phabricator.wikimedia.org/T208843) (owner: 10GTirloni) [11:31:53] (03PS3) 10Banyek: mariadb: depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472405 (https://phabricator.wikimedia.org/T208954) [11:36:17] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10akosiaris) [11:41:36] !log draining ganeti2006 for reboot/kernel security update [11:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:59] (03CR) 10Effie Mouzeli: [C: 032] Change rdb1005 and rdb1006 to redis::misc master/slave [puppet] - 10https://gerrit.wikimedia.org/r/472412 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [11:45:10] (03PS4) 10Effie Mouzeli: Change rdb1005 and rdb1006 to redis::misc master/slave [puppet] - 10https://gerrit.wikimedia.org/r/472412 (https://phabricator.wikimedia.org/T206450) [11:45:21] RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 0.85 ms [11:49:30] 10Operations, 10ORES, 10Scoring-platform-team, 10Performance: Diagnose and fix 4.5k req/min ceiling for ores* requests - https://phabricator.wikimedia.org/T182249 (10akosiaris) >>! In T182249#4730735, @Halfak wrote: > I like the proposal of depooling one datacenter. What do you think @akosiaris? Is this... [11:50:55] !log akosiaris@puppetmaster1001 conftool action : set/weight=38; selector: dc=eqiad,service=apertium,cluster=scb,name=scb1001.* [11:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:17] !log increase weight of scb1001 for apertium to 50% [11:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:22] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ``` rdb1006.eqiad.wmnet ``` The log can be found in `/var/... [11:54:31] PROBLEM - etcd request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 operation=get https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:56:42] RECOVERY - etcd request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:57:42] !log draining ganeti2005 for reboot/kernel security update [11:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:21] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181108T1200). Please do the needful. [12:00:04] Thiemo_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:57] I can SWAT today [12:01:43] * Thiemo_WMDE is here and ready. :) [12:01:59] Thiemo_WMDE: you're not a deployer, right? [12:02:05] No. [12:02:16] Not a deployer. [12:06:01] Thiemo_WMDE: I'll ping you in a few minutes when the first patch is ready for testing [12:06:22] * Thiemo_WMDE *thumbs up* [12:08:39] (03PS2) 10Zfilipin: Set AdvancedSearch to default on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472134 (https://phabricator.wikimedia.org/T207641) (owner: 10Thiemo Kreuz (WMDE)) [12:09:21] (03PS3) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) [12:10:11] (03CR) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) (owner: 10Arturo Borrero Gonzalez) [12:13:03] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472134 (https://phabricator.wikimedia.org/T207641) (owner: 10Thiemo Kreuz (WMDE)) [12:14:18] (03Merged) 10jenkins-bot: Set AdvancedSearch to default on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472134 (https://phabricator.wikimedia.org/T207641) (owner: 10Thiemo Kreuz (WMDE)) [12:17:03] Thiemo_WMDE: 472134 is at mwdebug1002, please test and let me know if I can deploy it [12:18:08] !log draining ganeti2004 for reboot/kernel security update [12:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:30] zeljkof: Yea, that was easy. Already confirmed! [12:18:41] Thiemo_WMDE: ok to deploy? [12:18:48] Ok to deploy. [12:18:56] ok, deploying [12:19:47] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:472134|Set AdvancedSearch to default on group0 wikis (T207641)]] (duration: 00m 55s) [12:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:50] T207641: Set AdvancedSearch to default on group0 wikis - https://phabricator.wikimedia.org/T207641 [12:19:58] Thiemo_WMDE: deployed, please test [12:20:21] zeljkof: Tested, works as it should. Done! \o/ [12:21:02] Thiemo_WMDE: the second one is still waiting for CI, but looks like only one job left, no problems so far [12:21:24] zeljkof: There is a browser test failing. This is unrelated. Please ignore it. [12:21:43] I did, but I'm not sure if I'll be able to make gerrit ignore it :) [12:22:00] I think I've done it before, we'll see if I still remember how it's done [12:22:00] Uh. :( [12:24:15] Thiemo_WMDE: 472175 is merged and all green, the failing job either does not run when merging, or it was removed [12:24:44] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['rdb1006.eqiad.wmnet'] ``` and were **ALL** successful. [12:25:01] Note the TwoColConflict extension is disabled right now. So there is no way for me to test this right now. [12:25:24] Thiemo_WMDE: ah, the second commit can be deployed directly? no testing? [12:25:31] Exactly. [12:25:59] (Should have said this earlier. Sorry.) [12:27:01] (03CR) 10jenkins-bot: Set AdvancedSearch to default on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472134 (https://phabricator.wikimedia.org/T207641) (owner: 10Thiemo Kreuz (WMDE)) [12:28:27] Thiemo_WMDE: no problem, deploying [12:29:33] !log zfilipin@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/TwoColConflict: SWAT: [[gerrit:472175|Fix harmless edits turning into conflicts (T205942 T208840 T209012 T209036)]] (duration: 00m 55s) [12:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:40] T205942: Some edit submits get fatal InvalidArgumentException "The title does not refer to an existing page" from TwoColConflict extension - https://phabricator.wikimedia.org/T205942 [12:29:41] T209012: Preview on a new site breaks and leads to loose of the site content - https://phabricator.wikimedia.org/T209012 [12:29:41] T209036: The title "Foo" does not refer to an existing page - https://phabricator.wikimedia.org/T209036 [12:29:41] T208840: Monitor user reports about imaginary conflicts - https://phabricator.wikimedia.org/T208840 [12:29:57] Thiemo_WMDE: deployed, since there's nothing to test, thanks for deploying with #releng ;) [12:30:27] !log EU SWAT finished [12:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:43] !log draining ganeti2003 for reboot/kernel security update [12:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:28] <_joe_> moritzm: I still see a strange alert on ganeti2004 FWIW [12:38:24] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 613.57 seconds [12:39:25] banyek: ^^ [12:41:23] !log Shutdown and reimage rdb200[56] - T206450 [12:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:26] T206450: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 [12:41:50] vgutierez: tx [12:46:55] jouncebot: now [12:46:55] For the next 0 hour(s) and 13 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181108T1200) [12:48:01] (03CR) 10Thifranc: "About stderr and stdout to log + logrotate, I've been following the indications from Giuseppe Lavagetto on the phabricator related issue :" [puppet] - 10https://gerrit.wikimedia.org/r/470877 (https://phabricator.wikimedia.org/T150375) (owner: 10Thifranc) [12:48:44] (03PS1) 10Thiemo Kreuz (WMDE): Re-enable wmgUseTwoColConflict on group0 only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472431 (https://phabricator.wikimedia.org/T205942) [12:49:25] PROBLEM - IPsec on rdb1007 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: rdb2005_v4 [12:49:42] is there any log of critical error occurences on wikis? [12:49:58] (03PS2) 10Thiemo Kreuz (WMDE): Re-enable wmgUseTwoColConflict on group0 only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472431 (https://phabricator.wikimedia.org/T205942) [12:50:08] (03PS4) 10Thifranc: puppet:Reduce cronspam from modules/mediawiki/ [puppet] - 10https://gerrit.wikimedia.org/r/470877 (https://phabricator.wikimedia.org/T150375) [12:50:29] (03CR) 10Addshore: [C: 032] Re-enable wmgUseTwoColConflict on group0 only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472431 (https://phabricator.wikimedia.org/T205942) (owner: 10Thiemo Kreuz (WMDE)) [12:50:34] * addshore steals a bit of swat [12:50:42] dbstore1002 replication catches up [12:51:44] (03Merged) 10jenkins-bot: Re-enable wmgUseTwoColConflict on group0 only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472431 (https://phabricator.wikimedia.org/T205942) (owner: 10Thiemo Kreuz (WMDE)) [12:52:24] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 611.22 seconds [12:52:39] !log draining ganeti2002 for reboot/kernel security update [12:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:43] (03PS2) 10Thiemo Kreuz (WMDE): Revert "Disable wmgUseTwoColConflict everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472411 (https://phabricator.wikimedia.org/T205942) [12:53:22] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Re-enable wmgUseTwoColConflict on group0 only T205942 T208840 T209012 T209036 (duration: 00m 54s) [12:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:29] T205942: Some edit submits get fatal InvalidArgumentException "The title does not refer to an existing page" from TwoColConflict extension - https://phabricator.wikimedia.org/T205942 [12:53:29] T209012: Preview on a new site breaks and leads to loose of the site content - https://phabricator.wikimedia.org/T209012 [12:53:30] T209036: The title "Foo" does not refer to an existing page - https://phabricator.wikimedia.org/T209036 [12:53:30] T208840: Monitor user reports about imaginary conflicts - https://phabricator.wikimedia.org/T208840 [12:54:16] * addshore is done [12:55:48] (03CR) 10jenkins-bot: Re-enable wmgUseTwoColConflict on group0 only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472431 (https://phabricator.wikimedia.org/T205942) (owner: 10Thiemo Kreuz (WMDE)) [12:56:44] !log akosiaris@puppetmaster1001 conftool action : set/weight=3800; selector: dc=eqiad,service=apertium,cluster=scb,name=scb1001.* [12:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:58] !log increase weight of scb1001 for apertium to 99+% [12:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:38] (03PS2) 10GTirloni: tools - Remove unused jmail exim4 config [puppet] - 10https://gerrit.wikimedia.org/r/471942 (https://phabricator.wikimedia.org/T208579) [12:58:44] RECOVERY - Host kubestage1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [12:59:14] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.53 seconds [12:59:42] (03CR) 10GTirloni: [C: 032] tools - Remove unused jmail exim4 config [puppet] - 10https://gerrit.wikimedia.org/r/471942 (https://phabricator.wikimedia.org/T208579) (owner: 10GTirloni) [13:00:01] * addshore will have one more for mw-config [13:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181108T1300) [13:00:12] that I will squeeze in [13:02:49] (03PS1) 10Thiemo Kreuz (WMDE): Re-enable wmgUseTwoColConflict on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472434 (https://phabricator.wikimedia.org/T205942) [13:03:21] (03PS2) 10Filippo Giunchedi: hieradata: rollout rsyslog_exporter in esams [puppet] - 10https://gerrit.wikimedia.org/r/472404 (https://phabricator.wikimedia.org/T205849) [13:03:26] (03CR) 10Addshore: [C: 032] Re-enable wmgUseTwoColConflict on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472434 (https://phabricator.wikimedia.org/T205942) (owner: 10Thiemo Kreuz (WMDE)) [13:03:28] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: rollout rsyslog_exporter in esams [puppet] - 10https://gerrit.wikimedia.org/r/472404 (https://phabricator.wikimedia.org/T205849) (owner: 10Filippo Giunchedi) [13:04:33] (03Merged) 10jenkins-bot: Re-enable wmgUseTwoColConflict on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472434 (https://phabricator.wikimedia.org/T205942) (owner: 10Thiemo Kreuz (WMDE)) [13:05:34] (03PS2) 10Ema: cache: turn on IPsec for cp2006 and cp2012 [puppet] - 10https://gerrit.wikimedia.org/r/472149 (https://phabricator.wikimedia.org/T208588) [13:06:03] (03PS3) 10Ema: cache: turn on IPsec for cp2006 and cp2012 [puppet] - 10https://gerrit.wikimedia.org/r/472149 (https://phabricator.wikimedia.org/T208588) [13:06:37] (03CR) 10Ema: [C: 032] cache: turn on IPsec for cp2006 and cp2012 [puppet] - 10https://gerrit.wikimedia.org/r/472149 (https://phabricator.wikimedia.org/T208588) (owner: 10Ema) [13:07:46] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Re-enable wmgUseTwoColConflict on dewiki - T205942 T208840 T209012 T209036 (duration: 00m 53s) [13:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:53] T205942: Some edit submits get fatal InvalidArgumentException "The title does not refer to an existing page" from TwoColConflict extension - https://phabricator.wikimedia.org/T205942 [13:07:53] T209012: Preview on a new site breaks and leads to loose of the site content - https://phabricator.wikimedia.org/T209012 [13:07:54] T209036: The title "Foo" does not refer to an existing page - https://phabricator.wikimedia.org/T209036 [13:07:54] T208840: Monitor user reports about imaginary conflicts - https://phabricator.wikimedia.org/T208840 [13:08:02] right, that ius me done [13:09:15] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/472410 (https://phabricator.wikimedia.org/T204759) (owner: 10Elukey) [13:09:45] (03CR) 10jenkins-bot: Re-enable wmgUseTwoColConflict on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472434 (https://phabricator.wikimedia.org/T205942) (owner: 10Thiemo Kreuz (WMDE)) [13:10:15] (03PS1) 10Gehel: elasticsearch: hotthread logger needs python3-yaml [puppet] - 10https://gerrit.wikimedia.org/r/472435 [13:10:33] (03PS1) 10Gilles: Lower WebP thumbnail hotness threshold further [puppet] - 10https://gerrit.wikimedia.org/r/472436 (https://phabricator.wikimedia.org/T27611) [13:11:59] (03PS1) 10Gehel: elasticsearch: hotthread logger sends output to log_file [puppet] - 10https://gerrit.wikimedia.org/r/472439 [13:12:07] (03PS3) 10Thiemo Kreuz (WMDE): Revert "Disable wmgUseTwoColConflict everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472411 (https://phabricator.wikimedia.org/T205942) [13:12:18] (03PS16) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [13:13:25] !log upload rsyslog 8.38.0-1~bpo9+1wmf1 to stretch-wikimedia, version bump only [13:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:45] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 607.86 seconds [13:14:15] PROBLEM - SSH kubestage1001.mgmt on kubestage1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:06] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10ayounsi) [13:16:52] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [13:18:14] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:20:06] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10ayounsi) 05Open>03Resolved This has been completed successfully. Everything went as expected, nothing other than C4 went offline. Maintenance took 45min longer than... [13:20:12] mhhhmmmm Get https://docker-registry.wikimedia.org/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) [13:22:05] well, thats a pain [13:22:37] addshore: one time thing or keeps happening? [13:22:46] keeps happening [13:22:49] "docker pull docker-registry.wikimedia.org/nodejs-devel" [13:22:56] * addshore goes to try it one some other machine too [13:23:07] heh I was gonna say, from which host [13:23:15] thats from my local machine [13:23:44] hmm, it works from a random labs machine [13:23:47] so it could be a local issue [13:24:06] (03CR) 10DCausse: "should we remove the >> /var/log/elasticsearch/elasticsearch_hot_threads_errors.log ?" [puppet] - 10https://gerrit.wikimedia.org/r/472439 (owner: 10Gehel) [13:24:19] yeah works from here [13:24:28] (03CR) 10DCausse: [C: 031] elasticsearch: hotthread logger needs python3-yaml [puppet] - 10https://gerrit.wikimedia.org/r/472435 (owner: 10Gehel) [13:24:31] from my local machine that is [13:24:44] * addshore instructs docker to stop having issues... [13:24:54] i can curl it, so it is some weird docker thing i guess [13:25:50] working now, i just restarted my VM [13:25:54] (03CR) 10Volans: [C: 031] "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472405 (https://phabricator.wikimedia.org/T208954) (owner: 10Banyek) [13:25:54] thanks for the check :) [13:26:37] (03CR) 10Gehel: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/472439 (owner: 10Gehel) [13:27:02] (03Abandoned) 10Alex Monk: [DNM] Test commit [puppet] - 10https://gerrit.wikimedia.org/r/471195 (owner: 10Alex Monk) [13:27:39] (03PS2) 10Gehel: elasticsearch: hotthread logger sends output to log_file [puppet] - 10https://gerrit.wikimedia.org/r/472439 [13:27:55] (03CR) 10Gehel: [C: 032] elasticsearch: hotthread logger needs python3-yaml [puppet] - 10https://gerrit.wikimedia.org/r/472435 (owner: 10Gehel) [13:28:06] (03PS2) 10Gehel: elasticsearch: hotthread logger needs python3-yaml [puppet] - 10https://gerrit.wikimedia.org/r/472435 [13:30:06] PROBLEM - configured eth on cp2012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:30:15] PROBLEM - Check systemd state on cp2012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:30:34] yw [13:31:14] RECOVERY - configured eth on cp2012 is OK: OK - interfaces up [13:31:27] cp2012 is me ^ [13:33:24] RECOVERY - Check systemd state on cp2012 is OK: OK - running: The system is fully operational [13:34:13] (03CR) 10DCausse: [C: 031] elasticsearch: hotthread logger sends output to log_file [puppet] - 10https://gerrit.wikimedia.org/r/472439 (owner: 10Gehel) [13:35:03] (03PS3) 10Gehel: elasticsearch: hotthread logger sends output to log_file [puppet] - 10https://gerrit.wikimedia.org/r/472439 [13:36:05] !log failing over ganeti master in codfw from ganeti2001 to ganeti2003 [13:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:12] (03CR) 10Gehel: [C: 032] elasticsearch: hotthread logger sends output to log_file [puppet] - 10https://gerrit.wikimedia.org/r/472439 (owner: 10Gehel) [13:37:12] !log draining ganeti2001 for reboot/kernel security update [13:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:40] !log Done reimaging rdb1006 - T206450 [13:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:43] T206450: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 [13:39:14] PROBLEM - traffic-pool service on cp2012 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [13:41:04] PROBLEM - HTTPS Unified ECDSA on cp2012 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 13 days) [13:41:21] (03CR) 10Alexandros Kosiaris: "> Nice! Would it make more sense to upgrade to a more recent version of stdlib instead of picking just pieces of it?" [puppet] - 10https://gerrit.wikimedia.org/r/472363 (owner: 10Dzahn) [13:42:54] PROBLEM - HTTPS Unified RSA on cp2012 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2018-11-22 07:59:59 +0000 (expires in 13 days) [13:46:44] PROBLEM - HTTPS Unified ECDSA on cp2006 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 13 days) [13:46:54] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: rollout rsyslog_exporter in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/472403 (https://phabricator.wikimedia.org/T205849) (owner: 10Filippo Giunchedi) [13:47:02] (03PS2) 10Filippo Giunchedi: hieradata: rollout rsyslog_exporter in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/472403 (https://phabricator.wikimedia.org/T205849) [13:48:34] PROBLEM - HTTPS Unified RSA on cp2006 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2018-11-22 07:59:59 +0000 (expires in 13 days) [13:48:40] (03PS2) 10GTirloni: labs puppetmaster: Remove old promethium baremetal stuff [puppet] - 10https://gerrit.wikimedia.org/r/470101 (owner: 10Alex Monk) [13:48:44] (03PS2) 10GTirloni: role::labs::instance: Remove $::virtual == 'kvm' check for promethium [puppet] - 10https://gerrit.wikimedia.org/r/470100 (owner: 10Alex Monk) [13:48:48] (03PS2) 10GTirloni: Remove all remaining wdq_mm references [puppet] - 10https://gerrit.wikimedia.org/r/463325 (owner: 10Alex Monk) [13:52:41] !log installing postgres updates on labsdb1006/1007 [13:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:05] (03PS1) 10Andrew Bogott: Horizon: move projects to eqiad1-r: recommendatation-api, sciencesource, sentry [puppet] - 10https://gerrit.wikimedia.org/r/472448 (https://phabricator.wikimedia.org/T204745) [13:56:32] (03PS1) 10Effie Mouzeli: Reimage rdb2005/rdb2006 [puppet] - 10https://gerrit.wikimedia.org/r/472449 (https://phabricator.wikimedia.org/T206450) [13:57:29] (03CR) 10Andrew Bogott: [C: 032] Horizon: move projects to eqiad1-r: recommendatation-api, sciencesource, sentry [puppet] - 10https://gerrit.wikimedia.org/r/472448 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [14:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181108T1400) [14:12:36] !log rebooting releases2001 for some tests with ssbd for KVM [14:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:52] (03CR) 10DCausse: "the logic looks fine to me, left one suggestion" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [14:16:59] (03PS1) 10Effie Mouzeli: role::eqiad::scb: switch rdb1003:6382 with rdb1005:6379 [puppet] - 10https://gerrit.wikimedia.org/r/472454 (https://phabricator.wikimedia.org/T206450) [14:54:44] PROBLEM - puppet last run on lvs4007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:02:14] RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 0.72 ms [15:07:55] !log zeroize asw-c8-codfw (decom) [15:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:14] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:59] (03PS1) 10Joal: Add timer importing page-history dumps to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/472472 (https://phabricator.wikimedia.org/T202489) [15:12:23] elukey: --^ for when ou have time :) [15:13:00] (03CR) 10jerkins-bot: [V: 04-1] Add timer importing page-history dumps to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/472472 (https://phabricator.wikimedia.org/T202489) (owner: 10Joal) [15:14:56] 10Operations, 10SCB: Changeprop: Error during deduplication - https://phabricator.wikimedia.org/T209064 (10jijiki) p:05Triage>03Normal [15:17:43] (03PS2) 10Elukey: Add timer importing page-history dumps to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/472472 (https://phabricator.wikimedia.org/T202489) (owner: 10Joal) [15:20:13] !log 'cd /srv/deployment/ores/deploy/submodules/wheels && sudo -u deploy-service git lfs pull' on all ores1* and ores2* hosts T209060 [15:20:15] RECOVERY - puppet last run on lvs4007 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:16] T209060: ores returns error for draftquality models - https://phabricator.wikimedia.org/T209060 [15:20:44] (03CR) 10Elukey: [C: 031] "pcc looks good https://puppet-compiler.wmflabs.org/compiler1002/13410/an-coord1001.eqiad.wmnet/!" [puppet] - 10https://gerrit.wikimedia.org/r/472472 (https://phabricator.wikimedia.org/T202489) (owner: 10Joal) [15:20:59] 10Operations, 10ops-codfw: Decommission asw-c8-codfw - https://phabricator.wikimedia.org/T209066 (10ayounsi) p:05Triage>03Low [15:25:44] !log rolling restart of celery on ores nodes (T209060) [15:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:47] T209060: ores returns error for draftquality models - https://phabricator.wikimedia.org/T209060 [15:29:25] akosiaris: it recovered: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=2&fullscreen&orgId=1&from=now-24h&to=now-1m [15:29:27] (03PS1) 10Addshore: beta: run git submodule sync on beta autoupdate [puppet] - 10https://gerrit.wikimedia.org/r/472476 (https://phabricator.wikimedia.org/T209065) [15:29:48] hio opsen, [15:30:06] it would be great if someone could review the above patch (and merge? :D) to unblock beta code updates [15:30:52] 10Operations, 10Cloud-VPS, 10Wikidata, 10Wikidata-Query-Service, and 2 others: WDQS tests can no longer edit test.wikidata.org - https://phabricator.wikimedia.org/T208986 (10Smalyshev) 05Open>03Resolved a:03Smalyshev Looks like it's fine now. [15:31:04] Amir1: \o/ [15:31:06] nice find [15:31:49] akosiaris: can I steal you to look at that beta code update puppet change? :D [15:31:55] addshore: I was looking at it [15:31:59] YAY! :D [15:32:05] have you tested it already ? [15:32:19] it does LGTM, just making sure [15:32:42] i havn't actually tested the script itself anywhere, but I don't see it breaking, i have tested what the script is doing though [15:32:47] but not using the script itself [15:33:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 031] beta: run git submodule sync on beta autoupdate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472476 (https://phabricator.wikimedia.org/T209065) (owner: 10Addshore) [15:33:12] ok, let's merge and see if beta will continue working after that ;-) [15:33:15] (03CR) 10Alexandros Kosiaris: [C: 032] beta: run git submodule sync on beta autoupdate [puppet] - 10https://gerrit.wikimedia.org/r/472476 (https://phabricator.wikimedia.org/T209065) (owner: 10Addshore) [15:35:09] (03PS1) 10Addshore: beta: update autoupdate submodule_sync doc string [puppet] - 10https://gerrit.wikimedia.org/r/472481 [15:35:14] bah ^^ updated doc string [15:35:56] how can i force a puppet run there? :) [15:36:00] * addshore never remembers [15:36:33] puppet agent -tv addshore [15:37:08] great! [15:37:26] alternatively it'll run once every 30 minutes [15:37:39] that option sounds slow :) [15:37:47] wow [15:37:48] hang on [15:37:52] root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production u+16-144)# [15:37:59] it's massively behind on puppet commits [15:38:03] * Krenair fixes [15:38:26] (03PS1) 10Effie Mouzeli: role::redis::misc: Added maxmemory-policy: volatile-lru option [puppet] - 10https://gerrit.wikimedia.org/r/472484 (https://phabricator.wikimedia.org/T209064) [15:39:28] (03PS1) 10Bstorm: Revert "wiki replicas: depool lasbdb1010 for view changes" [puppet] - 10https://gerrit.wikimedia.org/r/472485 [15:39:56] Krenair: yeh, when i just ran puppet i didnt get my updated file :( [15:39:58] yeah so we have an old version of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/434055/ cherry-picked [15:40:10] Krenair: any idea why? [15:40:13] so puppet updates in beta have been broken for 3ish days [15:40:20] well it got merged and someone forgot to remove the old cherry-pick [15:40:23] at least it is only 3 days :) [15:40:32] should normally be done by whoever cherry-picked it [15:40:47] * addshore slaps whoever cherrypicked it with a trout [15:40:52] well Krinkle did back in July [15:41:31] what I'm going to do is drop the cherry-pick and rebase [15:41:38] coool [15:41:44] tell me when i can re run puppet again [15:41:44] pick a11a860e8e [LOCAL HACK] tls certs for deployment-elastic* [15:41:44] pick 4d48ecb01e Serve WebP variants for the hottest thumbnails [15:41:44] pick a7cd1b6ea0 Move declaration of diamond package out of diamond class [15:41:49] (03CR) 10Alexandros Kosiaris: [C: 032] beta: update autoupdate submodule_sync doc string [puppet] - 10https://gerrit.wikimedia.org/r/472481 (owner: 10Addshore) [15:41:55] thanks akosiaris :) [15:42:05] now it's rebased properly [15:42:10] root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production * u+15)# [15:42:13] * addshore runs puppet again [15:42:14] RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 0.81 ms [15:42:43] !log disable /etc/logrotate.d/udp2log-mw for a bit on mwlog1001 [15:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:49] addshore, so according to deployment-deploy01:/etc/cron.d/puppet that host normally runs puppet at 29 and 59 minutes past the hour. I think those numbers are generated based on a hash of the hostname [15:45:26] cool, beta code updates are fixed :) [15:45:29] speedy resolution [15:49:14] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:49:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "technically LGTM, minor comment about the commit message" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472484 (https://phabricator.wikimedia.org/T209064) (owner: 10Effie Mouzeli) [15:49:48] (03CR) 10Alexandros Kosiaris: [C: 031] role::eqiad::scb: switch rdb1003:6382 with rdb1005:6379 [puppet] - 10https://gerrit.wikimedia.org/r/472454 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [15:50:24] (03CR) 10Alexandros Kosiaris: [C: 031] Reimage rdb2005/rdb2006 [puppet] - 10https://gerrit.wikimedia.org/r/472449 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [15:53:02] (03PS2) 10Vgutierrez: certcentral: Evaluate order status after creation [software/certcentral] - 10https://gerrit.wikimedia.org/r/472188 (https://phabricator.wikimedia.org/T208948) [15:56:29] (03PS2) 10Effie Mouzeli: role::redis::misc: Added maxmemory-policy: volatile-lru option [puppet] - 10https://gerrit.wikimedia.org/r/472484 (https://phabricator.wikimedia.org/T209064) [15:57:34] (03CR) 10jerkins-bot: [V: 04-1] role::redis::misc: Added maxmemory-policy: volatile-lru option [puppet] - 10https://gerrit.wikimedia.org/r/472484 (https://phabricator.wikimedia.org/T209064) (owner: 10Effie Mouzeli) [16:00:03] (03PS1) 10Vgutierrez: certcentral: Stop using acme.client.poll_and_finalize() [software/certcentral] - 10https://gerrit.wikimedia.org/r/472487 (https://phabricator.wikimedia.org/T208967) [16:01:37] (03PS3) 10Effie Mouzeli: role::redis::misc: Added maxmemory-policy: volatile-lru option [puppet] - 10https://gerrit.wikimedia.org/r/472484 (https://phabricator.wikimedia.org/T209064) [16:02:19] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Stop using acme.client.poll_and_finalize() [software/certcentral] - 10https://gerrit.wikimedia.org/r/472487 (https://phabricator.wikimedia.org/T208967) (owner: 10Vgutierrez) [16:05:54] (03PS2) 10Vgutierrez: certcentral: Stop using acme.client.poll_and_finalize() [software/certcentral] - 10https://gerrit.wikimedia.org/r/472487 (https://phabricator.wikimedia.org/T208967) [16:05:59] pylint is never happy [16:06:50] <_joe_> vgutierrez: you know what you need to do [16:06:52] (03CR) 10Effie Mouzeli: [C: 032] role::redis::misc: Added maxmemory-policy: volatile-lru option [puppet] - 10https://gerrit.wikimedia.org/r/472484 (https://phabricator.wikimedia.org/T209064) (owner: 10Effie Mouzeli) [16:06:53] <_joe_> I think https://jynus.com/better-call-volans.jpg [16:07:09] we should look into black at some point [16:07:32] not b black, just https://black.readthedocs.io/en/stable/ [16:07:35] (03CR) 10Giuseppe Lavagetto: [C: 031] "good catch, sigh" [puppet] - 10https://gerrit.wikimedia.org/r/472484 (https://phabricator.wikimedia.org/T209064) (owner: 10Effie Mouzeli) [16:07:47] it's not 100% volans-compliant though ;) [16:08:16] <_joe_> paravoid: I was joking about it with volans today [16:08:33] <_joe_> I was proposing to let it run on pybal's codebase [16:08:38] <_joe_> :P [16:08:38] so, php7 is working on the debugservers now? :) [16:08:44] <_joe_> addshore: yes [16:08:52] (03CR) 10Alex Monk: [C: 032] acme_requests: log order URI on non-recoverable finalization errors [software/certcentral] - 10https://gerrit.wikimedia.org/r/471995 (https://phabricator.wikimedia.org/T208859) (owner: 10Vgutierrez) [16:09:00] amazing, I'm going to go and poke around them a bit then :) [16:09:01] <_joe_> if you select "php 7.x" you will browse using it [16:09:13] (03PS1) 10Cwhite: statsd_proxy: add socket_receive_bufsize parameter [puppet] - 10https://gerrit.wikimedia.org/r/472489 (https://phabricator.wikimedia.org/T196484) [16:09:17] <_joe_> thanks, still very early work [16:09:39] <_joe_> expect rough edges [16:10:02] * addshore is looking forward to ditching hhvm [16:10:17] <_joe_> I will miss some of the HHVM goodies tbh [16:10:35] paravoid: on a random file on spicerack black added 30% more lines [16:10:35] such as which? :) [16:10:48] and I don't consider myself a space-saver ;) [16:10:53] (03Merged) 10jenkins-bot: acme_requests: log order URI on non-recoverable finalization errors [software/certcentral] - 10https://gerrit.wikimedia.org/r/471995 (https://phabricator.wikimedia.org/T208859) (owner: 10Vgutierrez) [16:10:57] addshore: Brilliant English there [16:11:04] Reedy: shhh [16:11:07] volans: because of each import in a separate line? [16:11:18] also things like [16:11:19] logger = logging.getLogger( [16:11:19] __name__ [16:11:19] ) [16:11:26] why? [16:11:43] and all def [16:11:44] <_joe_> addshore: like the fact perf was able to inspect where in php a busy thread is spending time [16:12:00] (03PS4) 10Effie Mouzeli: role::redis::misc: Added maxmemory-policy: volatile-lru option [puppet] - 10https://gerrit.wikimedia.org/r/472484 (https://phabricator.wikimedia.org/T209064) [16:12:07] correction: not all defs, all defs with more than X params [16:12:12] <_joe_> addshore: or the exection model that is threaded and not using ugly prefork like php-fpm does [16:12:25] <_joe_> addshore: or the wealth of metrics and debug info you get from hhvm [16:12:43] volans: it was somewhere on the faq IIRC [16:12:55] <_joe_> from an operational/observability POV, I think hhvm is *much* better than php-fpm [16:13:03] (03CR) 10jenkins-bot: acme_requests: log order URI on non-recoverable finalization errors [software/certcentral] - 10https://gerrit.wikimedia.org/r/471995 (https://phabricator.wikimedia.org/T208859) (owner: 10Vgutierrez) [16:13:18] _joe_: gotcha :) [16:13:37] _joe_: is there any sign of any of that coolness heading into php7? [16:16:50] <_joe_> addshore: well PHP 8 should include a jit compiler, IIRC, so at least perf inspectability should come back [16:17:06] <_joe_> I won't miss the esoteric deadlocks though for sure [16:22:13] (03PS11) 10Dzahn: icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) [16:23:50] (03PS1) 10Bstorm: wiki replicas: depool labsdb1009 for updates [puppet] - 10https://gerrit.wikimedia.org/r/472492 (https://phabricator.wikimedia.org/T189158) [16:24:10] (03CR) 10Vgutierrez: "Tested successfully in certcentral1001, log pasted in https://phabricator.wikimedia.org/T208948#4728733" [software/certcentral] - 10https://gerrit.wikimedia.org/r/472188 (https://phabricator.wikimedia.org/T208948) (owner: 10Vgutierrez) [16:26:30] (03PS2) 10Effie Mouzeli: Reimage rdb2005/rdb2006 [puppet] - 10https://gerrit.wikimedia.org/r/472449 (https://phabricator.wikimedia.org/T206450) [16:29:29] !log enable Zayo transit on cr3-ulsfo [16:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:13] (03PS2) 10Cwhite: statsd_proxy: add socket_receive_bufsize parameter [puppet] - 10https://gerrit.wikimedia.org/r/472489 (https://phabricator.wikimedia.org/T196484) [16:30:43] (03PS3) 10Cwhite: statsd_proxy: add socket_receive_bufsize parameter [puppet] - 10https://gerrit.wikimedia.org/r/472489 (https://phabricator.wikimedia.org/T196484) [16:35:54] (03CR) 10Effie Mouzeli: [C: 032] Reimage rdb2005/rdb2006 [puppet] - 10https://gerrit.wikimedia.org/r/472449 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [16:38:47] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10jcrespo) While it looks good, please wait to have at least one positive review unless there is an emergency- as all t... [16:39:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Relabel labvirt1017.eqiad.wmnet as cloudvirt1017.eqiad.wmnet - https://phabricator.wikimedia.org/T208945 (10aborrero) a:05aborrero>03None [16:45:39] (03CR) 10Jcrespo: [C: 031] "This is deployable as is, check if you can go the extra mile and do the stilistic changes I suggest on the comments given on the files, so" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/470851 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [16:45:49] 10Operations, 10ops-eqiad: Missing rack face/position for 2 eqiad devices - https://phabricator.wikimedia.org/T209073 (10ayounsi) p:05Triage>03Low [16:47:58] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10jcrespo) I voted +1 but check if you can fix some minor style things I suggest so all mariadb roles look similar. Oth... [16:48:15] (03CR) 10Filippo Giunchedi: [C: 031] statsd_proxy: add socket_receive_bufsize parameter [puppet] - 10https://gerrit.wikimedia.org/r/472489 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [16:50:14] !log akosiaris@deploy1001 scap-helm zotero install --name alextest --set main_app.version=20181019165254-production --set monitoring.enable=true charts/zotero [namespace: zotero, clusters: staging] [16:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:23] !log akosiaris@deploy1001 scap-helm zotero install --name alextest --set main_app.version=20181019165254-production --set monitoring.enable=true charts/zotero [namespace: zotero, clusters: staging] [16:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:38] dammit [16:51:03] 10Operations, 10ops-codfw, 10ops-eqiad, 10ops-ulsfo: Devices with wmf* names and status active - https://phabricator.wikimedia.org/T209074 (10ayounsi) p:05Triage>03Low [16:59:11] !log akosiaris@deploy1001 scap-helm zotero [namespace: zotero, clusters: staging] [16:59:11] !log akosiaris@deploy1001 scap-helm zotero cluster staging completed [16:59:11] !log akosiaris@deploy1001 scap-helm zotero finished [16:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:33] grrr [16:59:35] need to fix this [17:00:05] godog and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181108T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:01:55] RECOVERY - traffic-pool service on cp2012 is OK: OK - traffic-pool is active [17:02:04] \o/ [17:06:02] 10Operations, 10Traffic, 10Patch-For-Review: Add ex cp-misc_codfw to text and upload - https://phabricator.wikimedia.org/T208588 (10ema) cp2006/cp2012 reimaged and added to cache_text. The nodes are currently depooled but ready to be put back into service. [17:07:13] !log upload libfastjson 0.99.8-1~bpo9+1wmf1 version bump only [17:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:02] (03PS1) 10Legoktm: Add PHP version information to log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) [17:11:37] <_joe_> legoktm: stop doing my job for me, or I'll have to pay you :P [17:11:43] <_joe_> seriously, <3 [17:11:58] hahaha [17:15:26] <_joe_> legoktm: [17:15:51] <_joe_> I was wondering like 2 hours ago that I should make it possible to distinguish the logs in the dashboard [17:16:28] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10herron) No Objections. Ready to go forward with this! [17:16:37] I beat you by a few hours then, I had the idea last night when I was talking with b.pirkle :) [17:17:41] the other thing I was thinking about was the PHP7 + XWMD [17:18:07] IIRC the browser extension has a timeout, so it auto disables after like 5 minutes [17:18:13] 10Operations, 10ops-codfw: Rename asw-c4-codfw and asw2-c4-codfw - https://phabricator.wikimedia.org/T209077 (10ayounsi) p:05Triage>03Low [17:19:06] https://github.com/wikimedia/WikimediaDebug/blob/master/background.js#L85 15 minutes [17:19:13] but I think we want the php7 part to be more sticky [17:19:53] so either we hack the extension to not auto timeout, or allow for some other opt-in mechanism like the cookie we used last time [17:20:04] _joe_: ^ [17:20:48] <_joe_> legoktm: well you can add the cookie to your browser instead [17:21:30] <_joe_> legoktm: my idea was to start working on traffic layer separation [17:21:34] RECOVERY - Host kubestage1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [17:21:44] <_joe_> and after than, create a beta feature [17:21:57] <_joe_> until we get there, I think it's ok for us to set up a cookie manually [17:24:08] !log repooling labsdb1010 (T189158) [17:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:11] T189158: Change `image` view to properly expose the new `img_description_id` field - https://phabricator.wikimedia.org/T189158 [17:24:19] (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool lasbdb1010 for view changes" [puppet] - 10https://gerrit.wikimedia.org/r/472485 (owner: 10Bstorm) [17:24:33] (03PS2) 10Banyek: Revert "wiki replicas: depool lasbdb1010 for view changes" [puppet] - 10https://gerrit.wikimedia.org/r/472485 (owner: 10Bstorm) [17:24:38] (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool lasbdb1010 for view changes" [puppet] - 10https://gerrit.wikimedia.org/r/472485 (owner: 10Bstorm) [17:28:34] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:28:47] _joe_: can XWMD be set via a cookie too? I didn't realize that [17:29:07] !log depooling labsdb1009 (T189158) [17:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:11] <_joe_> legoktm: not XWMD, but choosing php over hhvm, yes [17:29:20] (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1009 for updates [puppet] - 10https://gerrit.wikimedia.org/r/472492 (https://phabricator.wikimedia.org/T189158) (owner: 10Bstorm) [17:29:26] how? :) [17:29:32] (03PS2) 10Banyek: wiki replicas: depool labsdb1009 for updates [puppet] - 10https://gerrit.wikimedia.org/r/472492 (https://phabricator.wikimedia.org/T189158) (owner: 10Bstorm) [17:29:35] (03CR) 10Banyek: [V: 032 C: 032] wiki replicas: depool labsdb1009 for updates [puppet] - 10https://gerrit.wikimedia.org/r/472492 (https://phabricator.wikimedia.org/T189158) (owner: 10Bstorm) [17:29:35] <_joe_> PHP_ENGINE=php7 [17:33:54] RECOVERY - Host kubestage1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [17:33:55] _joe_: does it only work on debug servers? I set the cookie and I'm still seeing x-powered-by: HHVM... [17:36:01] I'll play with it later, gotta head to school now [17:36:14] <_joe_> legoktm: yes, only on the debug servers for now [17:36:18] <_joe_> next week, dunno [17:38:35] (03PS3) 10Vgutierrez: certcentral: Stop using acme.client.poll_and_finalize() [software/certcentral] - 10https://gerrit.wikimedia.org/r/472487 (https://phabricator.wikimedia.org/T208967) [17:40:43] (03PS3) 10Effie Mouzeli: Reimage rdb2005/rdb2006 [puppet] - 10https://gerrit.wikimedia.org/r/472449 (https://phabricator.wikimedia.org/T206450) [17:40:54] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:46:20] (03CR) 10Cwhite: [C: 032] statsd_proxy: add socket_receive_bufsize parameter [puppet] - 10https://gerrit.wikimedia.org/r/472489 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [17:46:28] (03PS4) 10Cwhite: statsd_proxy: add socket_receive_bufsize parameter [puppet] - 10https://gerrit.wikimedia.org/r/472489 (https://phabricator.wikimedia.org/T196484) [17:50:01] (03CR) 10Cwhite: [C: 031] icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [17:59:00] (03CR) 10Alex Monk: certcentral: Evaluate order status after creation (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/472188 (https://phabricator.wikimedia.org/T208948) (owner: 10Vgutierrez) [18:00:05] cscott, arlolra, subbu, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181108T1800). [18:02:28] (03PS3) 10Niedzielski: Enable Wikibase PageRandomLookup unexpected page_random value logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472356 (https://phabricator.wikimedia.org/T208796) [18:05:13] (03CR) 10Vgutierrez: certcentral: Evaluate order status after creation (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/472188 (https://phabricator.wikimedia.org/T208948) (owner: 10Vgutierrez) [18:09:50] (03CR) 10Alex Monk: certcentral: Evaluate order status after creation (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/472188 (https://phabricator.wikimedia.org/T208948) (owner: 10Vgutierrez) [18:09:54] (03PS1) 1020after4: Change static IPs to host names [puppet] - 10https://gerrit.wikimedia.org/r/472507 (https://phabricator.wikimedia.org/T208262) [18:12:35] 10Operations, 10Wikimedia-Mailing-lists: Requesting creation of librarycard-dev mailing list - https://phabricator.wikimedia.org/T209081 (10Samwalton9) [18:13:27] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10jijiki) [18:14:19] (03PS2) 1020after4: Change static IPs to host names [puppet] - 10https://gerrit.wikimedia.org/r/472507 (https://phabricator.wikimedia.org/T208262) [18:19:29] 10Operations, 10Patch-For-Review: Onboard Fabián Sellés Rosa to SRE - https://phabricator.wikimedia.org/T208715 (10MoritzMuehlenhoff) [18:19:30] !log update statsd-proxy to 0.0.9-2 on graphite1004 [18:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:39] 10Operations, 10Patch-For-Review: Onboard Fabián Sellés Rosa to SRE - https://phabricator.wikimedia.org/T208715 (10MoritzMuehlenhoff) Added to pwstore. [18:20:17] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), and 4 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) I tried to compare the mc1035's bytes read rate with the big... [18:21:07] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10ayounsi) [18:22:00] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ``` ['rdb2005.codfw.wmnet'] ``` The log can be found in `/... [18:24:02] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ``` ['rdb2006.codfw.wmnet'] ``` The log can be found in `/... [18:31:29] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:34:38] (03CR) 10BryanDavis: Add PHP version information to log entries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472498 (https://phabricator.wikimedia.org/T209076) (owner: 10Legoktm) [18:37:23] (03CR) 10Dzahn: [C: 032] icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:38:35] (03CR) 10Alex Monk: certcentral: Stop using acme.client.poll_and_finalize() (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/472487 (https://phabricator.wikimedia.org/T208967) (owner: 10Vgutierrez) [18:39:40] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:40:19] (03CR) 10Alex Monk: certcentral: Stop using acme.client.poll_and_finalize() (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/472487 (https://phabricator.wikimedia.org/T208967) (owner: 10Vgutierrez) [18:41:06] (03PS12) 10Dzahn: icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) [18:48:06] (03PS3) 10Vgutierrez: certcentral: Evaluate order status after creation [software/certcentral] - 10https://gerrit.wikimedia.org/r/472188 (https://phabricator.wikimedia.org/T208948) [18:48:08] (03PS4) 10Vgutierrez: certcentral: Stop using acme.client.poll_and_finalize() [software/certcentral] - 10https://gerrit.wikimedia.org/r/472487 (https://phabricator.wikimedia.org/T208967) [18:48:55] (03CR) 10Vgutierrez: certcentral: Evaluate order status after creation (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/472188 (https://phabricator.wikimedia.org/T208948) (owner: 10Vgutierrez) [18:52:16] 10Operations: Migrate tests from nose to pytest - https://phabricator.wikimedia.org/T208783 (10Bstorm) I recently added python tests to one of my modules with a significant python script in it. I honestly don't see how a python script that isn't tested that does something complicated enough to merit being in py... [18:53:17] (03CR) 10Vgutierrez: certcentral: Stop using acme.client.poll_and_finalize() (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/472487 (https://phabricator.wikimedia.org/T208967) (owner: 10Vgutierrez) [18:59:44] 10Operations: Migrate tests from nose to pytest - https://phabricator.wikimedia.org/T208783 (10Bstorm) To put it a different way, I'm uncomfortable deploying //untested// python to the environment in cloud and generally prefer to test infrastructure code in general (rspec/unittest or whatever). At least within... [19:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181108T1900). [19:00:04] Thiemo_WMDE: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:23] * Thiemo_WMDE is here. Again. :) [19:00:30] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['rdb2005.codfw.wmnet'] ``` and were **ALL** successful. [19:00:33] 10Operations, 10Release Pipeline, 10Release-Engineering-Team: Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10thcipriani) [19:02:48] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['rdb2006.codfw.wmnet'] ``` and were **ALL** successful. [19:06:00] Who is doing the SWAT? [19:09:00] (03CR) 10Pmiazga: "I think we should prefix this channel to makr this is cleary a wikibase specific channel. There are three different wikibase channels:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472356 (https://phabricator.wikimedia.org/T208796) (owner: 10Niedzielski) [19:10:21] (03CR) 10Pmiazga: "sorry for typos in the previous message, I tried to write it clearly. I wanted to say there is no nomenclature how to name wikibase channe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472356 (https://phabricator.wikimedia.org/T208796) (owner: 10Niedzielski) [19:10:51] addshore: Sorry to bother you again, do you know if the SWAT planned for right now will happen? [19:12:11] (03CR) 1020after4: "cherry-picked on beta" [puppet] - 10https://gerrit.wikimedia.org/r/472507 (https://phabricator.wikimedia.org/T208262) (owner: 1020after4) [19:14:33] (03PS12) 10Cwhite: ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) [19:14:44] zeljkof: Uh, do you know who is doing this SWAT? [19:16:56] (03CR) 10Cwhite: [C: 032] ntp: move diamond::collector to where it will only apply to ntp servers [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [19:18:36] MaxSem: RoanKattouw: Dereckson: Sorry to bother you. You are listed next to the SWAT window where I would like to have a config change deployed. [19:18:56] Which is *now*, according to the bot. [19:20:18] Am I doing something wrong? Am I on the wrong channel? [19:20:45] you do seem to be in the right channel, and I can hear you :) [19:21:01] give me a few minutes and I can SWAT [19:21:08] Thanks, at least one ping from one of the 235 people in this room. Thanks. :-) [19:21:23] * thcipriani doffs cap [19:21:32] not everyone here can SWAT :P [19:21:38] I know. :-) [19:24:18] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:25:03] okie doke so looks like we're reverting a revert [19:25:22] fix looks like it was swatted earlier [19:25:26] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472411 (https://phabricator.wikimedia.org/T205942) (owner: 10Thiemo Kreuz (WMDE)) [19:25:28] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:26:35] (03Abandoned) 10Niedzielski: Enable Wikibase PageRandomLookup unexpected page_random value logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472356 (https://phabricator.wikimedia.org/T208796) (owner: 10Niedzielski) [19:26:44] (03Merged) 10jenkins-bot: Revert "Disable wmgUseTwoColConflict everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472411 (https://phabricator.wikimedia.org/T205942) (owner: 10Thiemo Kreuz (WMDE)) [19:27:56] Thiemo_WMDE: your change is on mwdebug1002, check please [19:28:33] Checking right now. [19:29:46] thcipriani: Done. Works as expected. Please go on. [19:29:58] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:30:40] ok, going live everywhere [19:30:48] RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 305.04 ms [19:33:20] (03CR) 10jenkins-bot: Revert "Disable wmgUseTwoColConflict everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472411 (https://phabricator.wikimedia.org/T205942) (owner: 10Thiemo Kreuz (WMDE)) [19:35:40] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:472411|Revert "Disable wmgUseTwoColConflict everywhere"]] T205942 T208840 T209012 T209036 (duration: 00m 54s) [19:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:47] T205942: Some edit submits get fatal InvalidArgumentException "The title does not refer to an existing page" from TwoColConflict extension - https://phabricator.wikimedia.org/T205942 [19:35:47] T209012: Preview on a new site breaks and leads to loose of the site content - https://phabricator.wikimedia.org/T209012 [19:35:48] T209036: The title "Foo" does not refer to an existing page - https://phabricator.wikimedia.org/T209036 [19:35:48] T208840: Monitor user reports about imaginary conflicts - https://phabricator.wikimedia.org/T208840 [19:35:49] ^ Thiemo_WMDE live everywhere now [19:36:03] Yes, I see it and can confirm. [19:36:07] Thanks a lot! [19:37:28] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:37:49] thanks for the patches, sorry for the delay: swat windows fall at weird times for folks listed as deployers on occasion. Usually someone has time to run SWAT, but unfortunately not always :( [19:38:04] (03CR) 10Alex Monk: [C: 032] certcentral: Stop using acme.client.poll_and_finalize() [software/certcentral] - 10https://gerrit.wikimedia.org/r/472487 (https://phabricator.wikimedia.org/T208967) (owner: 10Vgutierrez) [19:38:23] (03CR) 10Alex Monk: [C: 032] certcentral: Evaluate order status after creation [software/certcentral] - 10https://gerrit.wikimedia.org/r/472188 (https://phabricator.wikimedia.org/T208948) (owner: 10Vgutierrez) [19:39:53] (03PS1) 10Cwhite: icinga: add tmpfs options to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472519 (https://phabricator.wikimedia.org/T202782) [19:39:57] thcipriani, yeah it appears to be a recurring thing that no deployers are immediately available at the beginning of swat windows [19:40:26] (03Merged) 10jenkins-bot: certcentral: Evaluate order status after creation [software/certcentral] - 10https://gerrit.wikimedia.org/r/472188 (https://phabricator.wikimedia.org/T208948) (owner: 10Vgutierrez) [19:40:28] (03Merged) 10jenkins-bot: certcentral: Stop using acme.client.poll_and_finalize() [software/certcentral] - 10https://gerrit.wikimedia.org/r/472487 (https://phabricator.wikimedia.org/T208967) (owner: 10Vgutierrez) [19:40:59] (03PS2) 10Cwhite: icinga: add tmpfs options to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472519 (https://phabricator.wikimedia.org/T202782) [19:42:05] (03CR) 10jenkins-bot: certcentral: Evaluate order status after creation [software/certcentral] - 10https://gerrit.wikimedia.org/r/472188 (https://phabricator.wikimedia.org/T208948) (owner: 10Vgutierrez) [19:42:08] (03CR) 10jenkins-bot: certcentral: Stop using acme.client.poll_and_finalize() [software/certcentral] - 10https://gerrit.wikimedia.org/r/472487 (https://phabricator.wikimedia.org/T208967) (owner: 10Vgutierrez) [19:42:15] Krenair: indeed, happens with some frequency. Some part everyone being constantly busy, some part bystander effect :\ [19:44:30] probably mixed with a healthy smidgen of anxiety about breaking all the wikis :) [19:58:16] thcipriani, what were they thinking when they signed up to run swat then? xD [19:59:40] the question I mutter to myself as I compulsively refresh logstash in three tabs [19:59:48] (03CR) 10Faidon Liambotis: [C: 04-1] cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) (owner: 10Arturo Borrero Gonzalez) [20:00:04] thcipriani: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Americas version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181108T2000). [20:03:17] oh boy [20:03:22] * thcipriani does [20:03:50] (03CR) 10Thcipriani: [C: 031] "Only way I can think to achieve this outcome as well." [puppet] - 10https://gerrit.wikimedia.org/r/472018 (owner: 10Alex Monk) [20:04:16] lol [20:07:13] thcipriani, I was anxious about it at first [20:07:25] no one really walked me through the process [20:07:32] but after some time I was fine [20:07:48] and I think there are more protections around it now than there was then [20:08:02] it's gotten safer over time, I think [20:08:23] I hope [20:08:31] we didn't have the canary checks then [20:08:44] that came later [20:09:11] though apparently mediawiki-config tests are still limited to basic syntax? [20:11:16] apart from the tests we do syntax checks of php and json, goes to canaries we hit the canaries with some very basic requests/responses, we wait 20 seconds and we make sure logstash didn't explode, then we go everywhere. I think we've got this working for all syncs now. [20:11:46] so sync-file/sync-dir/sync-wikiversions and just sync do that [20:12:00] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) p:05Triage>03Normal [20:12:42] plus manual checks on mwdebug for swat which have probably caught more issues than anything else [20:13:26] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) [20:23:21] (03CR) 10Dzahn: [C: 04-1] "it looks almost perfect but the check_result_path gets set to no value in seems:" [puppet] - 10https://gerrit.wikimedia.org/r/472519 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [20:25:10] (03PS1) 10Thcipriani: all wikis to 1.33.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472526 [20:25:12] (03CR) 10Thcipriani: [C: 032] all wikis to 1.33.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472526 (owner: 10Thcipriani) [20:26:12] (03CR) 10Dzahn: [C: 04-1] "..but i don't see a reason why that happens.. ehm.." [puppet] - 10https://gerrit.wikimedia.org/r/472519 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [20:27:07] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472526 (owner: 10Thcipriani) [20:27:12] (03PS3) 10Cwhite: icinga: add tmpfs options to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472519 (https://phabricator.wikimedia.org/T202782) [20:27:16] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) [20:28:25] (03CR) 10Cwhite: "> ..but i don't see a reason why that happens.. ehm.." [puppet] - 10https://gerrit.wikimedia.org/r/472519 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [20:28:53] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.3 [20:29:15] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) [20:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:46] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472526 (owner: 10Thcipriani) [20:33:10] (03CR) 10Dzahn: [C: 04-1] icinga: add tmpfs options to stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472519 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [20:33:14] (03CR) 10Urbanecm: [C: 04-1] "I'm wondering if this will work with wmgUseGrowthExperiments still false. Per CommonSettings.php, if wmgUseGrowthExperiments is false, the" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471792 (https://phabricator.wikimedia.org/T208773) (owner: 10Kosta Harlan) [20:33:44] (03CR) 10Dzahn: "yes, finally found it too :) cool" [puppet] - 10https://gerrit.wikimedia.org/r/472519 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [20:34:46] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) [20:36:17] (03CR) 10Dzahn: [C: 032] "yes, all good now and noop on einsteinium: https://puppet-compiler.wmflabs.org/compiler1002/13414/" [puppet] - 10https://gerrit.wikimedia.org/r/472519 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [20:41:35] (03PS6) 10BBlack: Add Google Translate X-Analytics tagging [puppet] - 10https://gerrit.wikimedia.org/r/471257 (https://phabricator.wikimedia.org/T208795) (owner: 10Dr0ptp4kt) [20:42:01] (03CR) 10BBlack: [C: 032] Add Google Translate X-Analytics tagging [puppet] - 10https://gerrit.wikimedia.org/r/471257 (https://phabricator.wikimedia.org/T208795) (owner: 10Dr0ptp4kt) [20:46:31] (03CR) 10Sbisson: "Understanding first day is in the WikimediaEvents extension, not GrowthExperiments." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471792 (https://phabricator.wikimedia.org/T208773) (owner: 10Kosta Harlan) [20:48:37] RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 0.88 ms [20:50:17] what's up with that specicif mgmt address there. came back more than once now [20:50:36] (03CR) 10Catrope: Switch on data collection for Understanding First Day project (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471792 (https://phabricator.wikimedia.org/T208773) (owner: 10Kosta Harlan) [20:51:30] (03CR) 10Catrope: Switch on data collection for Understanding First Day project (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471792 (https://phabricator.wikimedia.org/T208773) (owner: 10Kosta Harlan) [20:54:02] (03PS1) 10GTirloni: Revert "wiki replicas: depool labsdb1009 for updates" [puppet] - 10https://gerrit.wikimedia.org/r/472530 (https://phabricator.wikimedia.org/T189158) [20:55:37] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:56:27] (03PS2) 10Urbanecm: Switch on data collection for Understanding First Day project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471792 (https://phabricator.wikimedia.org/T208773) (owner: 10Kosta Harlan) [20:58:11] (03PS3) 10Urbanecm: Switch on data collection for Understanding First Day project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471792 (https://phabricator.wikimedia.org/T208773) (owner: 10Kosta Harlan) [20:59:22] (03CR) 10Urbanecm: "I still would like to know if wmgUseGrowthExperiments=false is intended." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471792 (https://phabricator.wikimedia.org/T208773) (owner: 10Kosta Harlan) [21:06:37] RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 0.85 ms [21:06:44] 10Operations, 10ops-eqiad: kubestage1001.mgmt down or flapping - https://phabricator.wikimedia.org/T209112 (10Dzahn) [21:07:16] 10Operations, 10ops-eqiad: kubestage1001.mgmt down or flapping - https://phabricator.wikimedia.org/T209112 (10Dzahn) 16:06 <+icinga-wm> RECOVERY - Host kubestage1001.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 0.85 ms ^ but it kept doing this more than once now [21:13:37] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:15:06] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Cleanup WDQS logging configuration - https://phabricator.wikimedia.org/T206121 (10Smalyshev) 05Open>03Resolved a:03Smalyshev I think we have mostly achieved this. [21:15:32] ACKNOWLEDGEMENT - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T209112 [21:21:51] (03PS1) 10Bstorm: sonofgridengine: the configurator needs to find the right dirs [puppet] - 10https://gerrit.wikimedia.org/r/472570 (https://phabricator.wikimedia.org/T200557) [21:29:06] (03CR) 10Bstorm: [C: 032] sonofgridengine: the configurator needs to find the right dirs [puppet] - 10https://gerrit.wikimedia.org/r/472570 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [21:30:42] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: /srv 50472 MB (5% inode=87%) [21:36:44] (03CR) 10Andrew Bogott: "This looks OK to me. Once it'sbeen tested with a cherry-pick let me know and I can merge." [puppet] - 10https://gerrit.wikimedia.org/r/472018 (owner: 10Alex Monk) [21:41:53] (03CR) 10Alex Monk: "lgtm, only real change:" [puppet] - 10https://gerrit.wikimedia.org/r/472018 (owner: 10Alex Monk) [21:43:19] (03PS10) 10Andrew Bogott: network::constants labs: move deployment-prep instances out into hiera data [puppet] - 10https://gerrit.wikimedia.org/r/472018 (owner: 10Alex Monk) [21:44:47] (03CR) 10Andrew Bogott: [C: 032] network::constants labs: move deployment-prep instances out into hiera data [puppet] - 10https://gerrit.wikimedia.org/r/472018 (owner: 10Alex Monk) [21:49:03] (03PS1) 10BBlack: Add globalsign 2018 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/472578 [21:49:05] (03PS1) 10BBlack: Deploy inactive globalsign 2018 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/472579 [21:54:17] ACKNOWLEDGEMENT - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T209112 [21:56:23] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:01:07] 10Operations, 10Traffic: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) `Must-Staple` didn't turn out to be a realistic option for GlobalSign, we'll look at it again later/elsewhere! [22:01:17] (03PS2) 10BBlack: Add globalsign 2018 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/472578 (https://phabricator.wikimedia.org/T206804) [22:01:21] (03PS2) 10BBlack: Deploy inactive globalsign 2018 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/472579 (https://phabricator.wikimedia.org/T206804) [22:02:02] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [22:06:30] (03PS1) 10Bstorm: sonofgridengine: python 3.5 does subprocess.run differently [puppet] - 10https://gerrit.wikimedia.org/r/472582 (https://phabricator.wikimedia.org/T200557) [22:06:32] 10Operations, 10Traffic, 10Patch-For-Review: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) The dual RSA+ECDSA certs above have: ` Not Before: Nov 8 21:37:02 2018 GMT Not Before: Nov 8 21:21:04 2018 GMT ` Which leaves us plenty of room for clock skew on the deploy... [22:06:56] (03CR) 10BBlack: [C: 032] Add globalsign 2018 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/472578 (https://phabricator.wikimedia.org/T206804) (owner: 10BBlack) [22:14:38] (03CR) 10BBlack: [C: 032] Deploy inactive globalsign 2018 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/472579 (https://phabricator.wikimedia.org/T206804) (owner: 10BBlack) [22:22:14] (03PS2) 10Bstorm: sonofgridengine: python 3.5 does subprocess.run differently [puppet] - 10https://gerrit.wikimedia.org/r/472582 (https://phabricator.wikimedia.org/T200557) [22:24:42] (03CR) 10Bstorm: [C: 032] sonofgridengine: python 3.5 does subprocess.run differently [puppet] - 10https://gerrit.wikimedia.org/r/472582 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [22:37:15] !log gerrit - adding Lucas Werkmeister (WMDE) to 'wmf-deployment' group for +2 on mw-config for T208518 access request [22:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:45] misses a bot [22:37:49] T208518 [22:37:54] T1 [22:37:54] T1: Get puppet runs into logstash - https://phabricator.wikimedia.org/T1 [22:38:15] hmm.. that other one is also public.. [22:39:00] T208518 [22:39:05] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Lucas Werkmeister - https://phabricator.wikimedia.org/T208518 (10Dzahn) 05Resolved>03Open [22:40:06] T208518 [22:40:11] T666 [22:40:12] T666: Document product/release cycle - https://phabricator.wikimedia.org/T666 [22:40:15] weird [22:41:27] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Lucas Werkmeister - https://phabricator.wikimedia.org/T208518 (10Dzahn) 17:37 < mutante> !log gerrit - adding Lucas Werkmeister (WMDE) to 'wmf-deployment' group for +2 on mw-config for T208518 access request @Lucas... [22:44:10] huh [22:46:04] someone should check the stashbot exception logs [22:46:19] that ticket is public like any other and whether it's open or resolved made no difference, fwiw [22:46:24] it could be https://github.com/bd808/tools-stashbot/blob/master/stashbot/bot.py#L297 [22:47:33] RECOVERY - Disk space on contint1001 is OK: DISK OK [22:48:17] (03PS1) 10Bstorm: sonofgridengine: skip "dot" files in the config and backport tests [puppet] - 10https://gerrit.wikimedia.org/r/472588 (https://phabricator.wikimedia.org/T200557) [22:48:28] it looks like conduit phid.lookup returns correctly for that: [22:48:30] {"T208518": {"phid": "PHID-TASK-vnytbwbb6zijypghix6c", "uri": "https://phabricator.wikimedia.org/T208518", "typeName": "Task", "type": "TASK", "name": "T208518", "fullName": "T208518: Requesting access to deployment for Lucas Werkmeister", "status": "open"}} [22:48:59] !log gerrit - adding Thomas Arrow to 'wmf-deployment' group for +2 on mw-config for T208491 access request [22:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:30] same effect on this [22:49:39] both are access requests [22:50:12] they have the "REQUEST" tag on top [22:50:17] does that mean they are not "tasks" ? [22:50:46] the box next to "Public" that is [22:51:17] even if it did, stashbot doesn't appear to check for that [22:51:28] and phab's API does not indicate it anyway [22:52:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for tarrow - https://phabricator.wikimedia.org/T208491 (10Dzahn) < mutante> !log gerrit - adding Thomas Arrow to 'wmf-deployment' group for +2 on mw-config for T208491 access request You should now also have +2 on Gerrit. [22:52:50] bd808, ^ [22:53:30] * bd808 reads backscroll [22:54:18] (03PS2) 10Bstorm: sonofgridengine: skip "dot" files in the config and backport tests [puppet] - 10https://gerrit.wikimedia.org/r/472588 (https://phabricator.wikimedia.org/T200557) [22:54:56] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10tstarling) paravoid explained to me that librsvg 2.44 was uploaded to sid on November 3. There was [[https://lists.debian.org/debian-devel/2018/11/msg00035.html|some consternation]] a... [22:55:06] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: skip "dot" files in the config and backport tests [puppet] - 10https://gerrit.wikimedia.org/r/472588 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [22:56:20] bd808: the bot seems to ingore T208491 or T208518 while other tickets are not affected [22:56:26] they should also be normal public tickets [22:56:58] Thanks! [22:57:02] yeah, the both is logging them wiht notices like "Exception: Task T208518 is a security bug.", but I need to look at the code to see why [22:57:43] interesting. it must be related to them being access requests. but they also say Public on top [22:57:53] and they dont need to be hidden or security [22:57:56] (03PS3) 10Bstorm: sonofgridengine: skip "dot" files in the config and backport tests [puppet] - 10https://gerrit.wikimedia.org/r/472588 (https://phabricator.wikimedia.org/T200557) [22:59:01] bd808, the source doesn't appear to generate exceptions like that? [22:59:09] (03CR) 10Bstorm: [C: 032] sonofgridengine: skip "dot" files in the config and backport tests [puppet] - 10https://gerrit.wikimedia.org/r/472588 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [22:59:22] oh the repository I'm looking at is obsolete [22:59:46] it's linked in the bottom of https://tools.wmflabs.org/stashbot/ :/ [23:00:43] https://phabricator.wikimedia.org/diffusion/LTST/browse/master/stashbot/phab.py$56-57 [23:01:07] the paranoia code is firing here [23:01:59] Krenair, mutante: care to file a bug? I can look at it later tonight [23:02:34] bd808, it's because of this: [23:02:36] "std:maniphest:security_topic": "ops-access-request", [23:03:09] frankly it shouldn't even be checking that [23:03:33] people shouldn't be adding untrustworthy users to private tasks [23:03:37] sure, can file a bug [23:03:47] Krenair: well there was a bug where if stashbot got subscribed to a ticket that it would leak info [23:03:48] this thing is running in labs, a security task seen by it is compromised regardless of whether it ends up on IRC or not [23:04:30] the thing that triggered this code was a public task that was later marked as security [23:04:53] and whoever marked it for private security didn't audit the people with access? [23:05:14] not the bot's fault. [23:06:04] no, it wasn't, but we were trying to prevent similar issues in the future :) [23:06:13] this doesn't prevent the leak [23:06:15] it just covers it up [23:06:56] tarrow: oh yea, Gerrit should let you +2 now :) [23:07:09] by this point the details of the task have been processed in plain text by a labs instance and who knows what else has a copy of the API key to do the same [23:07:46] Krenair: I feel like you are ranting at me about the practice of risk mitigation being a false economy [23:07:58] T209124 [23:07:58] T209124: stashbot ignores some public access request tickets because it considers them to be security tickets - https://phabricator.wikimedia.org/T209124 [23:08:46] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Lucas Werkmeister - https://phabricator.wikimedia.org/T208518 (10Dzahn) 05Open>03Resolved [23:08:47] this isn't a rant [23:08:57] the bot simply should not be left with the security of security tasks [23:10:23] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [23:11:14] bstorm_: ^ [23:11:25] Oh! Sorry [23:11:32] Distracted [23:11:34] It's merging [23:11:48] no worries, just to let you know [23:11:53] thanks@ [23:12:00] s/@/! [23:12:42] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [23:18:47] 10Operations, 10LDAP-Access-Requests: Add Michael Grosse to 'wmde' LDAP group - https://phabricator.wikimedia.org/T208722 (10colewhite) `migr` user added to `wmde` LDAP group. [23:18:54] 10Operations, 10LDAP-Access-Requests: Add Michael Grosse to 'wmde' LDAP group - https://phabricator.wikimedia.org/T208722 (10colewhite) 05Open>03Resolved [23:29:04] (03Abandoned) 10Dzahn: confd: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/458618 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [23:29:45] (03Abandoned) 10Tim Starling: Temporarily disable Special:GlobalRenameUser [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470956 (https://phabricator.wikimedia.org/T208083) (owner: 10Tim Starling) [23:43:17] (03PS1) 10Bstorm: sonofgridengine: force check=False [puppet] - 10https://gerrit.wikimedia.org/r/472595 (https://phabricator.wikimedia.org/T200557) [23:44:47] (03CR) 10Bstorm: [C: 032] sonofgridengine: force check=False [puppet] - 10https://gerrit.wikimedia.org/r/472595 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [23:48:05] (03PS2) 10Zoranzoki21: Fix adding vendor files by default for commiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471533 [23:49:05] (03PS3) 10Zoranzoki21: Fix adding vendor files by default for commiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471533 (https://phabricator.wikimedia.org/T207058) [23:52:16] (03PS1) 10Rush: AlarmCounterLogster: move matching to regex and yaml config [puppet] - 10https://gerrit.wikimedia.org/r/472597 (https://phabricator.wikimedia.org/T208611) [23:53:15] (03CR) 10jerkins-bot: [V: 04-1] AlarmCounterLogster: move matching to regex and yaml config [puppet] - 10https://gerrit.wikimedia.org/r/472597 (https://phabricator.wikimedia.org/T208611) (owner: 10Rush) [23:53:35] (03PS2) 10Rush: AlarmCounterLogster: move matching to regex and yaml config [puppet] - 10https://gerrit.wikimedia.org/r/472597 (https://phabricator.wikimedia.org/T208611) [23:54:22] (03CR) 10jerkins-bot: [V: 04-1] AlarmCounterLogster: move matching to regex and yaml config [puppet] - 10https://gerrit.wikimedia.org/r/472597 (https://phabricator.wikimedia.org/T208611) (owner: 10Rush)