[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190206T0000). [00:00:04] kemayo and tgr: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:01] The patch I'm having deployed is, indeed, a fix for a bug I introduced. So. :D [00:03:01] I can SWAT. [00:03:39] twentyafterfour: are you finished with the train? [00:05:15] I don't see anything in the sal. [00:05:16] he is not logged in on tin so I guess that's a yes? [00:05:34] just wanted to make sure since it sounded like there was some major breakage [00:06:00] Yeah I don't think the train ran today. James_F do you know if it's safe to swat? [00:06:09] tgr: The train hasn't happened at all, but it's safe to SWAT. [00:07:05] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488070 (https://phabricator.wikimedia.org/T215126) (owner: 10Anomie) [00:08:14] (03Merged) 10jenkins-bot: Preserve Composer's include paths [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488070 (https://phabricator.wikimedia.org/T215126) (owner: 10Anomie) [00:09:58] tgr: That change^ is on mwdebug1002. [00:12:10] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487987 (https://phabricator.wikimedia.org/T215350) (owner: 10Gergő Tisza) [00:13:17] Niharika: verified [00:13:27] (03Merged) 10jenkins-bot: Add PHP version to MediaWiki logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487987 (https://phabricator.wikimedia.org/T215350) (owner: 10Gergő Tisza) [00:14:00] tgr: That change^ is on mwdebug1002 too. [00:14:11] Syncing your first one. [00:14:30] (03CR) 10jenkins-bot: Preserve Composer's include paths [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488070 (https://phabricator.wikimedia.org/T215126) (owner: 10Anomie) [00:14:33] (03CR) 10jenkins-bot: Add PHP version to MediaWiki logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487987 (https://phabricator.wikimedia.org/T215350) (owner: 10Gergő Tisza) [00:16:51] Niharika: also verified. Thanks! The third one is documentation-only. [00:16:52] !log niharika29@deploy1001 Synchronized wmf-config/CommonSettings.php: Preserve Composer's include paths - T215126, T215224 (duration: 01m 40s) [00:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:09] T215224: PEAR PHP classes are loaded from system packages instead of Composer packages in WMF production - https://phabricator.wikimedia.org/T215224 [00:17:09] T215126: PHP warning on some Echo email sending attempts due to mismatching PEAR file versions - https://phabricator.wikimedia.org/T215126 [00:18:03] tgr: Sounds good. [00:18:30] !log niharika29@deploy1001 Synchronized wmf-config/logging.php: Add PHP version to MW logs T215350 (duration: 00m 46s) [00:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:33] T215350: Wikimedia PHP log entries should include the version of PHP used - https://phabricator.wikimedia.org/T215350 [00:19:49] tgr: For https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/483339/, does it matter which of the two files is synced first? [00:20:03] (03PS3) 10Niharika29: Demistify $wmgMonologChannels Logstash debug level behavior [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483339 (owner: 10Gergő Tisza) [00:20:14] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483339 (owner: 10Gergő Tisza) [00:20:31] Niharika: no, they are docstring changes [00:20:48] Got it. [00:21:18] (03Merged) 10jenkins-bot: Demistify $wmgMonologChannels Logstash debug level behavior [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483339 (owner: 10Gergő Tisza) [00:23:45] !log niharika29@deploy1001 Synchronized wmf-config/logging.php: Demystify Logstash debug level behavior (duration: 00m 46s) [00:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:53] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Demystify Logstash debug level behavior (duration: 00m 51s) [00:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:31] thanks Niharika! [00:25:40] You're welcome! [00:26:04] (03CR) 10jenkins-bot: Demistify $wmgMonologChannels Logstash debug level behavior [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483339 (owner: 10Gergő Tisza) [00:27:14] Kemayo: Your change is on mwdebug1002 now. [00:28:14] (03PS1) 10Bstorm: toolforge: apply black formatter before adding more python packages [puppet] - 10https://gerrit.wikimedia.org/r/488206 (https://phabricator.wikimedia.org/T210116) [00:29:57] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:48] Uh oh. I wonder if that's related. [00:31:07] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 76873 bytes in 0.258 second response time [00:31:28] Niharika: My patch does seem to have worked, and I think there's no plausible way it could be causing any slowdowns itself. [00:31:52] Kemayo: Okay. Good to sync? [00:31:53] (03CR) 10Bstorm: [C: 03+2] toolforge: apply black formatter before adding more python packages [puppet] - 10https://gerrit.wikimedia.org/r/488206 (https://phabricator.wikimedia.org/T210116) (owner: 10Bstorm) [00:31:56] (Insofar as it's pure client-side rendering stuff.) [00:32:06] Niharika: Yup, sync away. [00:32:50] 10Operations, 10Analytics, 10Research, 10Article-Recommendation, and 3 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) Any further thoughts on this, i think we agree that best solution is to run and deploy these scripts from some vir... [00:33:52] !log niharika29@deploy1001 Synchronized php-1.33.0-wmf.14/extensions/MobileFrontend/: EditorOverlay: captcha/abusefilter weren't being shown correctly T215101, T202374 (duration: 00m 50s) [00:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:56] T215101: CAPTCHAs are no longer served to mobile users, causing a generic error instead - https://phabricator.wikimedia.org/T215101 [00:33:57] T202374: Cleanup Wikitext editor error handling - https://phabricator.wikimedia.org/T202374 [00:34:17] Kemayo: Sync done. [00:34:42] Niharika: Thanks! [00:34:57] Ans that wraps up the swat. :) [00:36:39] Niharika: probably related to T203664? [00:36:40] T203664: scap timeout checking index.php/api.php mwdebug1001 / mwdebug1002 - https://phabricator.wikimedia.org/T203664 [00:36:55] tgr: I finally got the branch cut but haven't sync'd it [00:37:21] also mwdebug tends to time out occasionally when you test it with X-WM-D [00:37:29] tgr: That looks like it yeah. [00:37:50] I guess I'll push out wmf.16 as soon as swat is all done? [00:38:11] twentyafterfour: It's all done. [00:44:36] (03PS1) 1020after4: testwikis wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488207 [00:44:38] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488207 (owner: 1020after4) [00:46:03] (03Merged) 10jenkins-bot: testwikis wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488207 (owner: 1020after4) [00:46:59] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Tgr) [00:47:42] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.33.0-wmf.16 refs T206670 [00:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:45] T206670: 1.33.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T206670 [00:49:35] (03CR) 10jenkins-bot: testwikis wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488207 (owner: 1020after4) [00:53:00] (03PS1) 10Bstorm: toolforge: shuffle some packages into and around genpp [puppet] - 10https://gerrit.wikimedia.org/r/488208 (https://phabricator.wikimedia.org/T210116) [00:56:01] 10Operations, 10hardware-requests: requesting WMF7426 as phabricator system in eqiad - https://phabricator.wikimedia.org/T215335 (10RobH) So the original phab1002 was requested on T195623, but then @dzahn advised (via discussion with @20after4) that it needed 64GB, not the 32GB it has. That leaves us with all... [00:56:09] 10Operations, 10hardware-requests: requesting WMF7426 as phabricator system in eqiad - https://phabricator.wikimedia.org/T215335 (10RobH) a:05faidon→03Dzahn [01:07:52] !log add maintenance and rollback to junos operations class [01:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:01] !log remove peering4/6 prefix-list from routers [01:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:53] (03PS1) 1020after4: group0 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488219 [01:54:55] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488219 (owner: 1020after4) [01:56:12] (03Merged) 10jenkins-bot: group0 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488219 (owner: 1020after4) [01:57:25] (03CR) 10jenkins-bot: group0 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488219 (owner: 1020after4) [02:02:37] !log twentyafterfour@deploy1001 scap failed: average error rate on 7/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [02:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:08:27] (03PS1) 1020after4: all wikis to 1.33.0-wmf.14 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488224 [02:08:29] (03CR) 1020after4: [C: 03+2] all wikis to 1.33.0-wmf.14 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488224 (owner: 1020after4) [02:10:11] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.14 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488224 (owner: 1020after4) [02:10:32] Well, that's not good. [02:12:47] James_F: yeah, not good [02:13:02] Fatal error: Invalid operand type was used: cannot perform this operation with arrays in /srv/mediawiki/php-1.33.0-wmf.16/languages/Language.php on line 519 [02:13:20] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/487345 changed languages/Language.php but it didn't change the + and that's a valid operator on arrays. [02:13:25] http://php.net/manual/en/language.operators.array.php [02:14:04] yeah it's a strange error, I don't quite know what to make of it [02:14:23] "includes/cache/localisation/LocalisationCache.php: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php" is more concerning [02:14:38] Anyway, it's late. File tasks and leave be? [02:15:27] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:16:57] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.14 refs T206670 [02:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:00] T206670: 1.33.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T206670 [02:18:19] James_F: I'm gonna try rebuilding localisation cache and then test on mwdebug.... task incoming [02:18:40] Hmm. [02:19:10] weird that error happened a bunch more and then subsided [02:20:00] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.14 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488224 (owner: 1020after4) [02:21:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:25:08] !log twentyafterfour@deploy1001 Started scap: sync and update localization for 1.33.0-wmf.16 [02:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:45] PROBLEM - Disk space on actinium is CRITICAL: DISK CRITICAL - free space: / 337 MB (3% inode=90%) [02:27:49] !log actinium - apt-get clean for 8% more disk space after icinga alert [02:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:03] RECOVERY - Disk space on actinium is OK: DISK OK [02:32:26] !log push firewall rule to pfw3-eqiad - T215364 [02:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:29] T215364: Deploy pfw policy to allow https to frmon.frdev.wikimedia.org - https://phabricator.wikimedia.org/T215364 [02:32:47] (03CR) 1020after4: [C: 03+2] Improve commit message for scap update-interwiki-cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472074 (owner: 10Dereckson) [02:33:28] 10Operations, 10fundraising-tech-ops, 10netops: Deploy pfw policy to allow https to frmon.frdev.wikimedia.org - https://phabricator.wikimedia.org/T215364 (10ayounsi) a:03ayounsi Done. No diff in codfw, 1 new rule in eqiad. [02:33:50] (03Merged) 10jenkins-bot: Improve commit message for scap update-interwiki-cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472074 (owner: 10Dereckson) [02:40:58] !log twentyafterfour@deploy1001 Finished scap: sync and update localization for 1.33.0-wmf.16 (duration: 15m 50s) [02:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:44] (03CR) 10jenkins-bot: Improve commit message for scap update-interwiki-cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472074 (owner: 10Dereckson) [02:47:40] !log actinium - blocking a bad domain and restarting squid3 [02:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:41] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild - https://phabricator.wikimedia.org/T203625 (10Krinkle) [02:52:48] 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Watching / External): Increase mwdebugXXXX hosts CPU and memory(?) - https://phabricator.wikimedia.org/T212955 (10Krinkle) [02:58:03] 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Watching / External): Increase mwdebugXXXX hosts CPU and memory(?) - https://phabricator.wikimedia.org/T212955 (10Krinkle) [03:01:07] !log twentyafterfour@deploy1001 Synchronized scap/plugins/updateinterwikicache.py: (no justification provided) (duration: 00m 55s) [03:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:55] !log actinium - gzipping and rotating some access logs [03:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:50] (03PS1) 1020after4: testwikis wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488225 [03:06:52] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488225 (owner: 1020after4) [03:07:24] late night deploys? ^.^ [03:07:58] (03Merged) 10jenkins-bot: testwikis wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488225 (owner: 1020after4) [03:09:08] legoktm: yeah trying to get the new branch working at least on testwikis [03:09:41] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.33.0-wmf.16 refs T206670 [03:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:44] T206670: 1.33.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T206670 [03:10:36] 10Operations, 10fundraising-tech-ops, 10netops: Deploy pfw policy to allow https to frmon.frdev.wikimedia.org - https://phabricator.wikimedia.org/T215364 (10cwdent) 05Open→03Resolved Working, thanks @ayounsi ! [03:11:53] (03PS9) 10Dzahn: gerrit: add icinga https check for actual JSON content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) [03:13:59] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.33.0-wmf.16 refs T206670 (duration: 04m 18s) [03:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:20] 10Operations, 10Wikimedia-Mailing-lists, 10User-herron: Ban recurrent spam to Wikimedia mailing lists (January 2019) - https://phabricator.wikimedia.org/T215251 (10herron) Sadly I'm seeing unexpected backscatter since merging https://gerrit.wikimedia.org/r/488022. Going to revert this for now while looking... [03:17:09] (03CR) 10jenkins-bot: testwikis wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488225 (owner: 1020after4) [03:17:20] (03CR) 10Herron: [C: 03+2] "seeing unexpected backscatter since merging this. reverting." [puppet] - 10https://gerrit.wikimedia.org/r/488022 (https://phabricator.wikimedia.org/T215251) (owner: 10MarcoAurelio) [03:17:44] (03PS1) 10Herron: Revert "lists: reject recurrent spam based on subject as stopgap" [puppet] - 10https://gerrit.wikimedia.org/r/488226 (https://phabricator.wikimedia.org/T215251) [03:18:01] (03CR) 10Dzahn: gerrit: add icinga https check for actual JSON content (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [03:18:34] (03PS2) 10Herron: Revert "lists: reject recurrent spam based on subject as stopgap" [puppet] - 10https://gerrit.wikimedia.org/r/488226 (https://phabricator.wikimedia.org/T215251) [03:18:53] herron: letting you merge first to avoid rebase war [03:19:12] ha! thanks :) [03:19:39] (03CR) 10Herron: [C: 03+2] Revert "lists: reject recurrent spam based on subject as stopgap" [puppet] - 10https://gerrit.wikimedia.org/r/488226 (https://phabricator.wikimedia.org/T215251) (owner: 10Herron) [03:20:38] (03PS10) 10Dzahn: gerrit: add icinga https check for actual JSON content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) [03:20:38] mutante: submitted and puppet-merged now, good to go [03:20:55] thanks! doing [03:21:08] (03CR) 10Dzahn: [C: 03+2] gerrit: add icinga https check for actual JSON content [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [03:30:42] 10Operations, 10MediaWiki-Email, 10Composer: PEAR PHP classes are loaded from system packages instead of Composer packages in WMF production - https://phabricator.wikimedia.org/T215224 (10Tgr) 05Open→03Resolved a:03Anomie > (Also, those PEAR packages should probably be fixed the way other packages hav... [03:31:46] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Dzahn) [03:37:38] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Tgr) [03:45:28] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Dzahn) @Nuria I agree it seems the most likely solution is using a Ganeti VM though but due to allhands we still did not have an SRE m... [03:46:31] 10Operations, 10PHP 7.0 support, 10Patch-For-Review: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Tgr) >>! In T211488#4926068, @Krinkle wrote: > Presumably, this is due to the `display_errors` INI settings from PHP7 being set incorrectly. Yup.... [03:58:59] (03CR) 10Dzahn: "the new check command has been added on the icinga server but the monitoring check and virtual host don't get realized since they are in a" [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [03:59:35] 10Operations, 10Wikimedia-General-or-Unknown, 10PHP 7.2 support: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Reedy) [04:04:15] (03CR) 10BryanDavis: [C: 03+1] toolforge: shuffle some packages into and around genpp [puppet] - 10https://gerrit.wikimedia.org/r/488208 (https://phabricator.wikimedia.org/T210116) (owner: 10Bstorm) [04:14:24] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Tgr) [04:17:05] (03PS1) 10Dzahn: icinga/gerrit: move gerrit monitoring to icinga module [puppet] - 10https://gerrit.wikimedia.org/r/488232 (https://phabricator.wikimedia.org/T215033) [04:21:45] (03CR) 10Dzahn: [C: 03+2] "without this icinga gets reloaded on each puppet run :/" [puppet] - 10https://gerrit.wikimedia.org/r/488232 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [04:22:08] (03PS2) 10Dzahn: icinga/gerrit: move gerrit monitoring to icinga module [puppet] - 10https://gerrit.wikimedia.org/r/488232 (https://phabricator.wikimedia.org/T215033) [04:26:09] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/488232 was needed or Icinga removed and added the check and reloaded Icing" [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [04:39:15] !log reloaded icinga service, cant find new check command definition [04:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:12] (03CR) 10Dzahn: "icinga can't find the new check command "The command defined for service Gerrit JSON does not exist" though i can confirm with grep it's n" [puppet] - 10https://gerrit.wikimedia.org/r/487901 (https://phabricator.wikimedia.org/T215033) (owner: 10Dzahn) [04:42:13] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Dzahn) new virtual "service" host: (so also ping check for gerrit.wikimedia.org as opposed to cobalt o... [04:48:27] (03PS1) 10CRusnov: Add reports element to reports path in netbox config [puppet] - 10https://gerrit.wikimedia.org/r/488235 [04:58:04] 10Operations, 10Wikimedia-General-or-Unknown, 10PHP 7.2 support: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Reedy) Looks like we have `3.1.1-1` installed, not `4.1.1-1+0~20180819152012.3+stretch~1.gbpd95942` (which is available) [05:25:24] Reedy: my understanding is that CLI is still using 7.0 and hasn't switched to 7.2 yet [05:25:54] Maybe.. But presumably we still need the newer package where we are using 7.2? [05:26:05] Or does redis magically work due to other reasons? [05:26:38] these packages are mostly all magic [05:27:09] I think but am not 100% sure that the 4.x package can support both 7.0 and 7.2 [05:27:18] would need to check the provides [05:28:56] (03CR) 10CRusnov: [C: 03+1] "trivial, so looks good. Language nit inline." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [05:49:33] (03CR) 10Elukey: ">" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [06:05:53] (03CR) 10Elukey: "> > These limits are pretty high. But beware that the limits will be" [puppet] - 10https://gerrit.wikimedia.org/r/488078 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [06:07:34] (03CR) 10Elukey: "I am also wondering if it would be worth to, by default, add specific user slices for all the opsens without any limits like root, so we s" [puppet] - 10https://gerrit.wikimedia.org/r/488078 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [06:12:36] (03CR) 10Elukey: Introduce systemd::slice::user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [06:13:37] (03PS4) 10Elukey: Introduce systemd::slice::all_users [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) [06:13:39] (03PS4) 10Elukey: Introduce profile::analytics::cluster::limits::statistics [puppet] - 10https://gerrit.wikimedia.org/r/488078 (https://phabricator.wikimedia.org/T212824) [06:13:58] (03Abandoned) 10Elukey: profile::toolforge::bastion: use systemd::slice::user [puppet] - 10https://gerrit.wikimedia.org/r/488079 (owner: 10Elukey) [06:24:38] (03PS1) 10Marostegui: production-m5.sql.erb: Revoke access to testreduce from ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/488238 (https://phabricator.wikimedia.org/T214740) [06:25:15] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488239 (https://phabricator.wikimedia.org/T210713) [06:27:05] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488239 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:28:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488239 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:29:02] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/mwrepl] [06:29:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097:3314 (duration: 01m 06s) [06:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:02] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:32:04] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/sudoers] [06:33:19] (03CR) 10DCausse: [C: 03+1] mwgrep: Query all search clusters [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson) [06:34:50] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh] [06:36:31] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488239 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:43:36] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:14] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:48:10] <_joe_> the cr1-eqiad alerts are Telia's ongoing maintenance I guess [06:56:34] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:58:26] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:00:38] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:01:14] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:05:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488240 [07:07:29] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488240 (owner: 10Marostegui) [07:08:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488240 (owner: 10Marostegui) [07:09:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097:3314 (duration: 00m 56s) [07:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488241 (https://phabricator.wikimedia.org/T210713) [07:09:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488240 (owner: 10Marostegui) [07:10:38] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488241 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:10:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:11:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488241 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:13:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 (duration: 00m 53s) [07:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:09] !log Deploy schema change on db1103:3314 (db1097:3314 was also done previously) - T210713 [07:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:12] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [07:14:51] !log Stop 's4' slave on dbstore1002 [07:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:45] !log Deploy schema change on wikitech T210713 [07:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:47] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [07:20:28] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488241 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:34:19] 10Operations, 10Mail, 10OTRS: OTRS receiving flood of emails - https://phabricator.wikimedia.org/T214604 (10akosiaris) 05Open→03Resolved Graphs in codfw mail[1] and eqiad mail[2] point out that this behavior has not reemerged since Jan 25, so I 'll tentatively close this as resolved. Feel free to reopen... [07:35:43] (03PS14) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [07:40:12] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488242 [07:45:14] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488242 (owner: 10Marostegui) [07:46:17] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488242 (owner: 10Marostegui) [07:47:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103:3314 (duration: 00m 54s) [07:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:54] (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488243 (https://phabricator.wikimedia.org/T210713) [07:49:00] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488243 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:50:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488243 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [07:51:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1121 (duration: 00m 54s) [07:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:13] !log Deploy schema change on db1121 - this will generate lag on s4 labs - also upgrade MySQL on db1121 T210713 [07:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:16] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [07:54:00] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488242 (owner: 10Marostegui) [07:54:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488243 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:01:10] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow Erik Bernhardson to have root access on stat1005 for GPU testing - https://phabricator.wikimedia.org/T215384 (10elukey) p:05Triage→03Normal [08:01:49] (03PS2) 10Muehlenhoff: lxc: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/486455 [08:02:33] (03CR) 10Muehlenhoff: [C: 03+2] lxc: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/486455 (owner: 10Muehlenhoff) [08:04:34] (03PS31) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [08:10:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:10:57] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488251 [08:14:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:15:16] 10Operations, 10netops, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10Gilles) I've updated the Google Spreadsheet with the figures up to yesterday. It seems like nothing changed from the end users' perspective in terms of median time-to-firs... [08:17:24] (03PS2) 10Muehlenhoff: role::graphite::base: Unconditionally use systemd [puppet] - 10https://gerrit.wikimedia.org/r/487874 [08:18:54] (03CR) 10Muehlenhoff: [C: 03+2] role::graphite::base: Unconditionally use systemd [puppet] - 10https://gerrit.wikimedia.org/r/487874 (owner: 10Muehlenhoff) [08:20:10] RECOVERY - EDAC syslog messages on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [08:22:21] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488251 (owner: 10Marostegui) [08:23:25] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488251 (owner: 10Marostegui) [08:24:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121 (duration: 00m 53s) [08:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:13] (03CR) 10Volans: administrative: add owner getter to Reason class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [08:27:53] (03PS4) 10Muehlenhoff: statsite/statsd: Unconditionally use systemd [puppet] - 10https://gerrit.wikimedia.org/r/487878 [08:28:06] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488251 (owner: 10Marostegui) [08:31:42] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [08:33:29] (03CR) 10Muehlenhoff: [C: 03+2] statsite/statsd: Unconditionally use systemd [puppet] - 10https://gerrit.wikimedia.org/r/487878 (owner: 10Muehlenhoff) [08:34:31] 10Operations, 10Operations-Software-Development, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10Mathew.onipe) [08:36:07] (03PS1) 10Mathew.onipe: Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) [08:38:32] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:39:40] (03PS2) 10Muehlenhoff: contint::packages::php: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/488005 [08:39:50] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:43:24] (03PS3) 10Mathew.onipe: admin: create new system groups for cloudelastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) [08:43:34] (03CR) 10Mathew.onipe: admin: create new system groups for cloudelastic nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [08:44:59] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488257 (https://phabricator.wikimedia.org/T210713) [08:46:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488257 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:47:18] (03CR) 10DCausse: cloudelastic: Add cloudelastic configs (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [08:47:34] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488257 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:48:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 (duration: 00m 53s) [08:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:59] !log Deploy schema change on db1091 - T210713 [08:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:01] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [08:50:20] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488257 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:51:10] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) The simplest architecture really is to touch swift objects on every retrieval, which can be done async, but the unknown is how much extra lo... [08:53:39] (03CR) 10Addshore: [C: 03+1] "IS.php needs syncing first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488114 (https://phabricator.wikimedia.org/T215147) (owner: 10Ladsgroup) [08:58:31] (03PS5) 10Marostegui: analytics-dbstore.sql: Initial research user role [puppet] - 10https://gerrit.wikimedia.org/r/487000 (https://phabricator.wikimedia.org/T214469) [08:59:52] (03CR) 10Marostegui: [C: 03+2] analytics-dbstore.sql: Initial research user role [puppet] - 10https://gerrit.wikimedia.org/r/487000 (https://phabricator.wikimedia.org/T214469) (owner: 10Marostegui) [09:00:35] !log Create research_role on dbstore1003-1005 on all instances - T214469 [09:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:53] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::worker: deploy analytics users [puppet] - 10https://gerrit.wikimedia.org/r/488260 (https://phabricator.wikimedia.org/T212256) [09:02:33] (03PS1) 10ArielGlenn: misc dumps: report names of most recent failed wikis if we bail out [dumps] - 10https://gerrit.wikimedia.org/r/488261 [09:03:11] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::worker: deploy analytics users [puppet] - 10https://gerrit.wikimedia.org/r/488260 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [09:03:13] (03CR) 10Filippo Giunchedi: [C: 04-1] Bump helm to 2.12.2 for security and features (032 comments) [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) (owner: 10Fsero) [09:03:38] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/14542/ seems to DTRT" [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto) [09:03:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto) [09:03:52] (03CR) 10Filippo Giunchedi: [C: 03+1] aptrepo: add prometheus-node-exporter components for all dists [puppet] - 10https://gerrit.wikimedia.org/r/486493 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [09:04:23] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10jijiki) [09:04:41] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10jijiki) p:05Triage→03Normal a:03jijiki [09:07:39] (03PS1) 10Elukey: Fix system::role for analytics_test_cluster's roles [puppet] - 10https://gerrit.wikimedia.org/r/488265 (https://phabricator.wikimedia.org/T212256) [09:08:17] (03CR) 10Elukey: [C: 03+2] Fix system::role for analytics_test_cluster's roles [puppet] - 10https://gerrit.wikimedia.org/r/488265 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [09:08:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "Haven't tested it but LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [09:14:43] (03PS15) 10Giuseppe Lavagetto: mediawiki::common: add proxy for services [puppet] - 10https://gerrit.wikimedia.org/r/483789 (https://phabricator.wikimedia.org/T210717) [09:15:42] !log swift codfw-prod: more weight for ms-be2047 - T209395 T209921 [09:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:46] T209395: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 [09:15:47] T209921: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 [09:20:14] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::worker: deploy analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/488266 (https://phabricator.wikimedia.org/T212256) [09:20:48] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::worker: deploy analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/488266 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [09:20:54] (03PS2) 10Elukey: role::analytics_test_cluster::hadoop::worker: deploy analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/488266 (https://phabricator.wikimedia.org/T212256) [09:20:56] (03CR) 10Elukey: [V: 03+2 C: 03+2] role::analytics_test_cluster::hadoop::worker: deploy analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/488266 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [09:22:29] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit) - https://phabricator.wikimedia.org/T215033 (10Paladox) Is the script on cobalt? [09:23:33] (03PS1) 10Marostegui: dbstore-grants: Add research user and fixing styling [puppet] - 10https://gerrit.wikimedia.org/r/488267 (https://phabricator.wikimedia.org/T214469) [09:25:10] (03PS2) 10Marostegui: dbstore-grants: Add research user and fixing styling [puppet] - 10https://gerrit.wikimedia.org/r/488267 (https://phabricator.wikimedia.org/T214469) [09:27:10] (03CR) 10Muehlenhoff: [C: 03+1] aptrepo: add prometheus-node-exporter components for all dists [puppet] - 10https://gerrit.wikimedia.org/r/486493 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [09:32:59] (03CR) 10Jcrespo: [C: 03+1] "Needs manual deploy" [puppet] - 10https://gerrit.wikimedia.org/r/488238 (https://phabricator.wikimedia.org/T214740) (owner: 10Marostegui) [09:33:04] !log Remove wikiuser from dbstore1003-dbstore1005 T210478 [09:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:08] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [09:34:02] (03CR) 10Muehlenhoff: Bump helm to 2.12.2 for security and features (034 comments) [debs/helm] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/488089 (https://phabricator.wikimedia.org/T215244) (owner: 10Fsero) [09:35:10] (03CR) 10Muehlenhoff: admin: create new system groups for cloudelastic nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [09:37:07] (03CR) 10Muehlenhoff: [C: 04-1] admins: create user with analytics-privatedata access for juliaglen (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [09:41:35] (03CR) 10Filippo Giunchedi: "> Patch Set 12:" [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [09:42:51] 10Operations, 10Traffic: esams cache layer mangles downloads of specific url - https://phabricator.wikimedia.org/T215389 (10akosiaris) [09:43:02] 10Operations, 10Traffic: esams cache layer mangles downloads of specific url - https://phabricator.wikimedia.org/T215389 (10akosiaris) p:05Triage→03Normal [09:44:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488273 [09:48:45] (03CR) 10Gehel: [C: 04-1] elasticsearch_cluster: fix issues from test result (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [09:49:17] (03PS13) 10Filippo Giunchedi: prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) [09:49:19] (03PS10) 10Filippo Giunchedi: WIP: hieradata: use v2 for prometheus1003 [puppet] - 10https://gerrit.wikimedia.org/r/486059 [09:49:36] 10Operations, 10Traffic: esams cache layer mangles downloads of specific url - https://phabricator.wikimedia.org/T215389 (10akosiaris) [09:50:16] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [09:53:37] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/14545/" [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [09:53:51] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] prometheus: add feature flag for v2 compat [puppet] - 10https://gerrit.wikimedia.org/r/486051 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [09:56:51] (03CR) 10Muehlenhoff: [C: 03+1] prometheus: upgrade to node-exporter 0.17 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [09:58:15] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[analytics-privatedata-users_ensure_members] [09:59:30] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488273 (owner: 10Marostegui) [10:00:01] 10Operations, 10Traffic: esams cache layer mangles downloads of specific url - https://phabricator.wikimedia.org/T215389 (10akosiaris) cp3030 seems to be in some trouble since approximately 04:30 [1] [1] https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=esams%20prometheus... [10:00:37] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488273 (owner: 10Marostegui) [10:00:45] 10Operations, 10Traffic: esams cache layer mangles downloads of specific url - https://phabricator.wikimedia.org/T215389 (10akosiaris) p:05Normal→03High [10:01:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1091 (duration: 00m 52s) [10:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:51] !log restart varnish-frontend on cp3030 T215389 [10:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:54] T215389: esams cache layer mangles downloads of specific url - https://phabricator.wikimedia.org/T215389 [10:03:55] 10Operations, 10Wiki-Loves-Love, 10Wikimedia-Mailing-lists: Reset password for wll mailling list - https://phabricator.wikimedia.org/T215390 (10Peachey88) [10:04:09] !log reimaging graphite2002 to buster [10:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:33] 10Operations, 10CirrusSearch, 10serviceops, 10Discovery-Search (Current work): Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10dcausse) a:03Joe Thanks @Joe! I'll follow-up on this and prepare mw-config patches to use these new entries. [10:05:21] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:05:34] 10Operations, 10Patch-For-Review: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011 (10fgiunchedi) >>! In T116011#4927466, @MoritzMuehlenhoff wrote: >>>! In T116011#4927275, @jbond wrote: >> I have created a simple module for configuereing ulogd2 avalible in https://gerrit.wikimedia.org/r/#... [10:07:57] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:08:11] 10Operations, 10Traffic: esams cache layer mangles downloads of specific url - https://phabricator.wikimedia.org/T215389 (10akosiaris) p:05High→03Low The restart of varnish-frontend on cp3030 indeed resolved the issue. I 'll lower priority but leave task open. Feel free to resolve however. [10:08:53] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488273 (owner: 10Marostegui) [10:25:48] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-herron: Ban recurrent spam to Wikimedia mailing lists (January 2019) - https://phabricator.wikimedia.org/T215251 (10MarcoAurelio) Hi @herron - I am sorry to learn it was causing problems. It did worked in preventing spam sent to `meta-over... [10:34:09] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-herron: Ban recurrent spam to Wikimedia mailing lists (January 2019) - https://phabricator.wikimedia.org/T215251 (10jcrespo) This may be related- there is ongoing spam to mailing lists coming from spoofs of existing central adresses-- this... [10:38:36] (03PS2) 10Marostegui: production-m5.sql.erb: Revoke access to testreduce from ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/488238 (https://phabricator.wikimedia.org/T214740) [10:39:36] (03CR) 10Marostegui: [C: 03+2] production-m5.sql.erb: Revoke access to testreduce from ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/488238 (https://phabricator.wikimedia.org/T214740) (owner: 10Marostegui) [10:41:24] !log Revoke access to testreduce from ruthenium on m5 - https://phabricator.wikimedia.org/T214740 [10:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:55] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) IIRC object expiration was considered years ago (i.e. https://wikitech.wikimedia.org/wiki/Swift/ObjectExpiration) and at the time consid... [10:42:04] (03PS2) 10Gehel: admins: create user with analytics-privatedata access for juliaglen [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [10:42:47] (03CR) 10Gehel: admins: create user with analytics-privatedata access for juliaglen (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [10:45:15] PROBLEM - Memory correctable errors -EDAC- on mw2206 is CRITICAL: 4 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw2206&var-datasource=codfw+prometheus/ops [10:45:39] 10Operations, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Marostegui) [10:47:49] (03CR) 10Arturo Borrero Gonzalez: "> I am also wondering if it would be worth to, by default, add" [puppet] - 10https://gerrit.wikimedia.org/r/488078 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [10:54:13] (03CR) 10Arturo Borrero Gonzalez: "Thanks for your work Luca :-)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/488077 (https://phabricator.wikimedia.org/T212824) (owner: 10Elukey) [10:58:17] 10Operations, 10Patch-For-Review: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011 (10jbond) > Do we have a sense of volume of logs? ulog has been running on jbond-puppet-client.puppet.eqiad.wmflabs since 2019-02-05 12:58:43, so roughly 20 hours. -rw-r--r-- 1 root root 116K Feb 6 10... [11:02:13] <_joe_> !log restarting nginx safely across the appserver fleets in order to be able to run puppet without errors [11:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:58] 10Puppet, 10cloud-services-team (Kanban): ops/puppet: generalize systemd resource control for users - https://phabricator.wikimedia.org/T215401 (10aborrero) [11:03:08] 10Puppet, 10cloud-services-team (Kanban): ops/puppet: generalize systemd resource control for users - https://phabricator.wikimedia.org/T215401 (10aborrero) p:05Triage→03Normal [11:18:58] PROBLEM - DPKG on mwmaint1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:19:38] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:19:49] 10Operations, 10cloud-services-team, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Prometheus to 2.6 in deployment-prep and tools - https://phabricator.wikimedia.org/T215272 (10fgiunchedi) [11:20:14] RECOVERY - DPKG on mwmaint1002 is OK: All packages OK [11:20:56] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational [11:24:27] 10Operations, 10RESTBase, 10RESTBase-Cassandra, 10Core Platform Team Backlog (Watching / External), and 2 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10MoritzMuehlenhoff) >>! In T212418#4904109, @mobrovac wrote: >>>! In T212418#4895809, @fgiunchedi wrote: >>>>! In T212... [11:28:32] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:31:10] (03PS1) 10GTirloni: role::wmcs::monitoring - Include profile::wmcs::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/488332 (https://phabricator.wikimedia.org/T215399) [11:32:04] (03CR) 10jerkins-bot: [V: 04-1] role::wmcs::monitoring - Include profile::wmcs::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/488332 (https://phabricator.wikimedia.org/T215399) (owner: 10GTirloni) [11:32:09] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "modules/profile/manifests/labs/monitoring.pp should be renamed to wmcs/monitoring.pp" [puppet] - 10https://gerrit.wikimedia.org/r/488332 (https://phabricator.wikimedia.org/T215399) (owner: 10GTirloni) [11:34:35] (03PS2) 10GTirloni: role::wmcs::monitoring - Include profile::wmcs::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/488332 (https://phabricator.wikimedia.org/T215399) [11:38:15] (03PS18) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [11:39:13] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [11:40:19] (03PS1) 10Filippo Giunchedi: prometheus: use v2 rules for beta [puppet] - 10https://gerrit.wikimedia.org/r/488344 (https://phabricator.wikimedia.org/T215272) [11:41:09] (03CR) 10GTirloni: [C: 03+2] role::wmcs::monitoring - Include profile::wmcs::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/488332 (https://phabricator.wikimedia.org/T215399) (owner: 10GTirloni) [11:43:21] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use v2 rules for beta [puppet] - 10https://gerrit.wikimedia.org/r/488344 (https://phabricator.wikimedia.org/T215272) (owner: 10Filippo Giunchedi) [11:43:29] (03PS2) 10Filippo Giunchedi: prometheus: use v2 rules for beta [puppet] - 10https://gerrit.wikimedia.org/r/488344 (https://phabricator.wikimedia.org/T215272) [11:53:09] (03PS1) 10Filippo Giunchedi: prometheus: add experimental alerts for beta [puppet] - 10https://gerrit.wikimedia.org/r/488348 [11:53:47] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add experimental alerts for beta [puppet] - 10https://gerrit.wikimedia.org/r/488348 (owner: 10Filippo Giunchedi) [11:53:55] (03PS2) 10Filippo Giunchedi: prometheus: add experimental alerts for beta [puppet] - 10https://gerrit.wikimedia.org/r/488348 [11:56:15] (03PS1) 10GTirloni: profile::wmcs::monitoring - Use openstack clientpackages [puppet] - 10https://gerrit.wikimedia.org/r/488354 (https://phabricator.wikimedia.org/T215399) [11:56:19] (03PS1) 10Filippo Giunchedi: prometheus: use .yml for v2 rules [puppet] - 10https://gerrit.wikimedia.org/r/488355 [11:56:38] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use .yml for v2 rules [puppet] - 10https://gerrit.wikimedia.org/r/488355 (owner: 10Filippo Giunchedi) [11:56:43] !log restarting varnish-fe in cp3042 - T215389 [11:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:46] T215389: esams cache layer mangles downloads of specific url - https://phabricator.wikimedia.org/T215389 [11:56:47] (03PS2) 10Filippo Giunchedi: prometheus: use .yml for v2 rules [puppet] - 10https://gerrit.wikimedia.org/r/488355 [11:57:05] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3042.esams.wmnet [11:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:50] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3042.esams.wmnet [11:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:35] (03CR) 10GTirloni: [C: 03+2] profile::wmcs::monitoring - Use openstack clientpackages [puppet] - 10https://gerrit.wikimedia.org/r/488354 (https://phabricator.wikimedia.org/T215399) (owner: 10GTirloni) [11:59:43] (03PS2) 10GTirloni: profile::wmcs::monitoring - Use openstack clientpackages [puppet] - 10https://gerrit.wikimedia.org/r/488354 (https://phabricator.wikimedia.org/T215399) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190206T1200). [12:00:04] dcausse and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:20] o/ [12:00:24] Mine is not estable [12:00:29] (03PS19) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [12:00:35] IS needs to be synced first [12:01:29] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:01:58] PROBLEM - Check systemd state on ms-be2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:05:23] (03PS1) 10GTirloni: profile::wmcs::monitoring - Fix missing package [puppet] - 10https://gerrit.wikimedia.org/r/488364 (https://phabricator.wikimedia.org/T215399) [12:08:50] (03CR) 10GTirloni: [C: 03+2] profile::wmcs::monitoring - Fix missing package [puppet] - 10https://gerrit.wikimedia.org/r/488364 (https://phabricator.wikimedia.org/T215399) (owner: 10GTirloni) [12:09:48] RECOVERY - Check systemd state on ms-be2032 is OK: OK - running: The system is fully operational [12:10:03] It seems there's no one [12:10:18] I can do SWAT I guess [12:10:24] dcausse: Around? [12:11:33] (03PS2) 10Ladsgroup: Use separate DB connection for ID insertions on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488114 (https://phabricator.wikimedia.org/T215147) [12:11:51] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::worker: add hadoop users [puppet] - 10https://gerrit.wikimedia.org/r/488367 (https://phabricator.wikimedia.org/T212256) [12:12:32] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:12:37] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::worker: add hadoop users [puppet] - 10https://gerrit.wikimedia.org/r/488367 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [12:12:57] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488114 (https://phabricator.wikimedia.org/T215147) (owner: 10Ladsgroup) [12:14:04] (03Merged) 10jenkins-bot: Use separate DB connection for ID insertions on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488114 (https://phabricator.wikimedia.org/T215147) (owner: 10Ladsgroup) [12:14:54] (03PS1) 10GTirloni: profile::wmcs::monitoring - Fix IPv6 lookup [puppet] - 10https://gerrit.wikimedia.org/r/488370 (https://phabricator.wikimedia.org/T215399) [12:16:23] (03PS2) 10GTirloni: profile::wmcs::monitoring - Fix IPv6 lookup [puppet] - 10https://gerrit.wikimedia.org/r/488370 (https://phabricator.wikimedia.org/T215399) [12:17:19] (03PS20) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [12:17:30] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:488114|Use separate DB connection for ID insertions on testwikidatawiki (T215147)]], Part I (duration: 00m 55s) [12:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:34] T215147: Deploy separate DB connection for ID insertions on wikidata.org - https://phabricator.wikimedia.org/T215147 [12:17:45] (03CR) 10GTirloni: [C: 03+2] profile::wmcs::monitoring - Fix IPv6 lookup [puppet] - 10https://gerrit.wikimedia.org/r/488370 (https://phabricator.wikimedia.org/T215399) (owner: 10GTirloni) [12:18:13] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:18:49] 10Operations, 10Wikibase-Containers, 10Wikidata, 10serviceops, and 2 others: Create a wmf production ready nginx image - https://phabricator.wikimedia.org/T209292 (10Addshore) >>! In T209292#4920883, @Ladsgroup wrote: > I would be in favor of not using nginx and turning WDQS gui to a proper nodejs applicat... [12:19:31] !log ladsgroup@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:488114|Use separate DB connection for ID insertions on testwikidatawiki (T215147)]], Part II (duration: 00m 54s) [12:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:48] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Release-Engineering-Team (Watching / External), 10Services (watching): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10akosiaris) Assuming we g... [12:19:58] (03CR) 10jenkins-bot: Use separate DB connection for ID insertions on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488114 (https://phabricator.wikimedia.org/T215147) (owner: 10Ladsgroup) [12:20:56] !log restarting varnish-fe safely across esams/text cluster - T215389 [12:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:58] T215389: esams cache layer mangles downloads of specific url - https://phabricator.wikimedia.org/T215389 [12:21:28] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370 (10Joe) 05Open→03Resolved [12:21:36] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational [12:21:36] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3041.esams.wmnet [12:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:01] (03PS1) 10Giuseppe Lavagetto: role::deployment_server: add the services proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/488373 (https://phabricator.wikimedia.org/T210717) [12:22:39] Logs seems clean [12:22:47] !log EU SWAT is done [12:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:54] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3041.esams.wmnet [12:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:07] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3040.esams.wmnet [12:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::deployment_server: add the services proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/488373 (https://phabricator.wikimedia.org/T210717) (owner: 10Giuseppe Lavagetto) [12:24:51] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop: remove ssh keys for analytics-search users [puppet] - 10https://gerrit.wikimedia.org/r/488375 (https://phabricator.wikimedia.org/T212256) [12:24:58] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:25:06] (03PS2) 10Giuseppe Lavagetto: role::deployment_server: add the services proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/488373 (https://phabricator.wikimedia.org/T210717) [12:25:12] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3040.esams.wmnet [12:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:25] (03PS21) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [12:26:42] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3033.esams.wmnet [12:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:32] (03PS2) 10Elukey: role::analytics_test_cluster::hadoop: remove ssh keys for analytics-search users [puppet] - 10https://gerrit.wikimedia.org/r/488375 (https://phabricator.wikimedia.org/T212256) [12:28:09] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop: remove ssh keys for analytics-search users [puppet] - 10https://gerrit.wikimedia.org/r/488375 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [12:28:15] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3033.esams.wmnet [12:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:16] (03PS3) 10Elukey: role::analytics_test_cluster::hadoop: remove ssh keys for analytics-search users [puppet] - 10https://gerrit.wikimedia.org/r/488375 (https://phabricator.wikimedia.org/T212256) [12:28:18] (03CR) 10Elukey: [V: 03+2 C: 03+2] role::analytics_test_cluster::hadoop: remove ssh keys for analytics-search users [puppet] - 10https://gerrit.wikimedia.org/r/488375 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [12:29:32] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3032.esams.wmnet [12:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:51] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp3032.esams.wmnet [12:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:30] PROBLEM - DPKG on deploy2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:32:02] <_joe_> that's me ^^ [12:32:14] PROBLEM - Check systemd state on deploy2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:33:32] RECOVERY - Check systemd state on deploy2001 is OK: OK - running: The system is fully operational [12:33:49] 10Operations, 10Traffic: esams cache layer mangles downloads of specific url - https://phabricator.wikimedia.org/T215389 (10Vgutierrez) Checking the rest of the text cluster in esams from bast3002 showed that all of them where affected. After restarting varnish-frontend the issue is gone. I'll leave the task o... [12:34:08] RECOVERY - DPKG on deploy2001 is OK: All packages OK [12:38:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks. I 'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/483432 (owner: 10Alexandros Kosiaris) [12:38:41] (03PS2) 10Alexandros Kosiaris: mx/otrs/lists: Move spamassasin to aggregate_networks [puppet] - 10https://gerrit.wikimedia.org/r/483432 [12:39:57] (03PS2) 10Alexandros Kosiaris: phab::exim: Move to aggregate_networks [puppet] - 10https://gerrit.wikimedia.org/r/483440 [12:40:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] phab::exim: Move to aggregate_networks [puppet] - 10https://gerrit.wikimedia.org/r/483440 (owner: 10Alexandros Kosiaris) [12:40:51] (03PS1) 10Elukey: role::analytics_cluster::hadoop::master|standby: remove more ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/488382 (https://phabricator.wikimedia.org/T212256) [12:41:27] (03PS2) 10Elukey: role::analytics_cluster::hadoop::master|standby: remove more ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/488382 (https://phabricator.wikimedia.org/T212256) [12:42:11] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::hadoop::master|standby: remove more ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/488382 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [12:51:20] (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler1002/14554/mw1241.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/483430 (owner: 10Alexandros Kosiaris) [12:52:55] (03PS2) 10Alexandros Kosiaris: hhvm: Switch to using domain_networks [puppet] - 10https://gerrit.wikimedia.org/r/483430 [12:54:57] (03PS5) 10Alexandros Kosiaris: mathoid: Update prometheus-stats.conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/486396 [12:56:56] (03CR) 10Jbond: aptrepo: add prometheus-node-exporter components for all dists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486493 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [12:57:43] (03PS1) 10Giuseppe Lavagetto: profile::services_proxy: require nginx_bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/488400 [12:59:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hhvm: Switch to using domain_networks [puppet] - 10https://gerrit.wikimedia.org/r/483430 (owner: 10Alexandros Kosiaris) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190206T1300) [13:02:30] PROBLEM - EDAC syslog messages on thumbor1004 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [13:02:39] (03CR) 10Jbond: prometheus: upgrade to node-exporter 0.17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [13:02:47] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "thanks! merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/486396 (owner: 10Alexandros Kosiaris) [13:04:05] (03CR) 10Jbond: prometheus: upgrade to node-exporter 0.17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [13:07:03] (03CR) 10Muehlenhoff: [C: 03+1] prometheus: upgrade to node-exporter 0.17 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [13:09:03] (03CR) 10Elukey: [C: 04-1] role::wmcs::openstack::main::labweb: add mcrouter config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487889 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [13:09:19] <_joe_> elukey: don't [13:09:54] <_joe_> using mcrouter for labweb is like trying to kill a mosquito with a bazooka [13:10:38] <_joe_> we should have better options [13:11:13] _joe_ ah ok, I thought it was the natural replacement for nutcracker [13:11:16] that's already there [13:11:24] (03PS1) 10Jcrespo: mariadb: Depool db2055 for performance testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488410 (https://phabricator.wikimedia.org/T93564) [13:11:36] <_joe_> elukey: unless you leave out all the replication code [13:12:04] <_joe_> and you just set mcrouter-aware = false in the configuration of the cache [13:12:19] <_joe_> so no tls, no replica across clusters, no mcrouter-aware setup [13:12:50] could wikitech be served by the appservers? [13:13:03] <_joe_> not right now, no [13:13:15] all right, will abandon the change [13:13:38] (03CR) 10Jcrespo: "WARNING: I will do some alters on enwiki.abuse_filter_log" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488410 (https://phabricator.wikimedia.org/T93564) (owner: 10Jcrespo) [13:13:47] (03Abandoned) 10Elukey: role::wmcs::openstack::main::labweb: add mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/487889 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [13:14:07] <_joe_> elukey: wait, I was looking at your change now [13:14:15] <_joe_> it would need minimal mangling AIUI [13:14:36] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [13:16:14] _joe_ I was about to ask to you how to properly override the main mcrouter::shards config, the one that I added doesn't get applied (common/mcrouter.yaml seems to get always the priority) [13:16:35] anyway, we can chat in the task what's best [13:16:38] <_joe_> elukey: oh that's sad [13:16:52] (03PS1) 10Filippo Giunchedi: prometheus: fix prometheus::rule beta invocation [puppet] - 10https://gerrit.wikimedia.org/r/488411 [13:21:12] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix prometheus::rule beta invocation [puppet] - 10https://gerrit.wikimedia.org/r/488411 (owner: 10Filippo Giunchedi) [13:21:21] (03PS2) 10Filippo Giunchedi: prometheus: fix prometheus::rule beta invocation [puppet] - 10https://gerrit.wikimedia.org/r/488411 [13:23:56] (03PS2) 10Mark Bergsma: Move Attribute constants from attributes to constants [debs/pybal] - 10https://gerrit.wikimedia.org/r/447808 [13:23:58] (03PS2) 10Mark Bergsma: Use absolute imports for all BGP modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/447809 [13:24:00] (03PS1) 10Mark Bergsma: Split off static BGP parse/encode methods into separate modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/488412 [13:24:39] (03CR) 10jerkins-bot: [V: 04-1] Use absolute imports for all BGP modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/447809 (owner: 10Mark Bergsma) [13:35:41] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) [13:35:53] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) p:05Triage→03Normal [13:36:10] (03PS1) 10Filippo Giunchedi: prometheus: reuse exec from ::server in ::rule [puppet] - 10https://gerrit.wikimedia.org/r/488420 [13:36:39] (03CR) 10jerkins-bot: [V: 04-1] prometheus: reuse exec from ::server in ::rule [puppet] - 10https://gerrit.wikimedia.org/r/488420 (owner: 10Filippo Giunchedi) [13:36:56] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) [13:37:43] ACKNOWLEDGEMENT - EDAC syslog messages on thumbor1004 is CRITICAL: 4.001 ge 4 Effie Mouzeli Server has memory errors - T215411 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [13:38:47] (03PS2) 10Filippo Giunchedi: prometheus: reuse exec from ::server in ::rule [puppet] - 10https://gerrit.wikimedia.org/r/488420 [13:38:49] (03PS11) 10Filippo Giunchedi: WIP: hieradata: use v2 for prometheus1003 [puppet] - 10https://gerrit.wikimedia.org/r/486059 [13:42:50] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: reuse exec from ::server in ::rule [puppet] - 10https://gerrit.wikimedia.org/r/488420 (owner: 10Filippo Giunchedi) [13:45:30] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db2055 for performance testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488410 (https://phabricator.wikimedia.org/T93564) (owner: 10Jcrespo) [13:46:54] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) a:05Halfak→03elukey [13:47:26] (03PS1) 10GTirloni: shinken: Remove high iowait alert [puppet] - 10https://gerrit.wikimedia.org/r/488426 (https://phabricator.wikimedia.org/T215412) [13:49:46] (03PS2) 10GTirloni: shinken: Remove high iowait alert [puppet] - 10https://gerrit.wikimedia.org/r/488426 (https://phabricator.wikimedia.org/T215412) [13:51:07] (03CR) 10GTirloni: [C: 03+2] shinken: Remove high iowait alert [puppet] - 10https://gerrit.wikimedia.org/r/488426 (https://phabricator.wikimedia.org/T215412) (owner: 10GTirloni) [13:52:06] (03PS3) 10Mark Bergsma: Use absolute imports for all BGP modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/447809 [13:53:20] (03Abandoned) 10Mark Bergsma: Split off static BGP parse/encode methods into separate modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/488412 (owner: 10Mark Bergsma) [13:53:34] (03PS1) 10Filippo Giunchedi: prometheus: fix beta v2 rules [puppet] - 10https://gerrit.wikimedia.org/r/488427 [13:54:17] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix beta v2 rules [puppet] - 10https://gerrit.wikimedia.org/r/488427 (owner: 10Filippo Giunchedi) [13:55:01] (03PS2) 10Filippo Giunchedi: prometheus: fix beta v2 rules [puppet] - 10https://gerrit.wikimedia.org/r/488427 [13:58:26] PROBLEM - EDAC syslog messages on mw2206 is CRITICAL: 4.008 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw2206&var-datasource=codfw+prometheus/ops [14:00:05] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190206T1400) [14:00:41] chooo chooo [14:02:04] 🚂 [14:02:13] (03CR) 10Mark Bergsma: [C: 03+1] Move Attribute constants from attributes to constants [debs/pybal] - 10https://gerrit.wikimedia.org/r/447808 (owner: 10Mark Bergsma) [14:02:33] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10jijiki) p:05Triage→03Normal [14:02:48] is the train going to be deployed only later on?? [14:02:56] (I am looking forward for a metawiki change) [14:03:57] (03PS2) 10Alexandros Kosiaris: dnsrecursor: Switch to using aggregate_networks [puppet] - 10https://gerrit.wikimedia.org/r/483431 [14:04:53] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on mw2206 is CRITICAL: 5.001 ge 4 Effie Mouzeli Server has memory errors - T215415 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw2206&var-datasource=codfw+prometheus/ops [14:05:23] ACKNOWLEDGEMENT - EDAC syslog messages on mw2206 is CRITICAL: 4.008 ge 4 Effie Mouzeli Server has memory errors - T215415 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw2206&var-datasource=codfw+prometheus/ops [14:06:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/483430 (owner: 10Alexandros Kosiaris) [14:06:58] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 Effie Mouzeli Server has memory errors - T215411 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [14:07:04] (03PS3) 10Alexandros Kosiaris: hhvm: Switch to using domain_networks [puppet] - 10https://gerrit.wikimedia.org/r/483430 [14:07:33] (03PS4) 10Mark Bergsma: Use absolute imports for all BGP modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/447809 [14:07:44] (03CR) 10Alexandros Kosiaris: mathoid: Update prometheus-stats.conf (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/486396 (owner: 10Alexandros Kosiaris) [14:09:28] (03PS2) 10Zoranzoki21: IS.php: Add wgProofreadPagePageJoiner, set it per default on '-' and at zhwikisource on __PAGEJOIN__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482502 (https://phabricator.wikimedia.org/T205826) [14:11:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Per https://puppet-compiler.wmflabs.org/compiler1002/14560/dns2001.wikimedia.org/ essentially a noop, merging" [puppet] - 10https://gerrit.wikimedia.org/r/483431 (owner: 10Alexandros Kosiaris) [14:11:23] (03PS3) 10Alexandros Kosiaris: dnsrecursor: Switch to using aggregate_networks [puppet] - 10https://gerrit.wikimedia.org/r/483431 [14:11:34] (03PS4) 10Zoranzoki21: Add categories for all Croatian projects at wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482548 [14:12:23] (03PS3) 10Zoranzoki21: Add category at wgGettingStartedExcludedCategories for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482534 [14:13:14] (03PS2) 10Zoranzoki21: Changed wgImportSources for srwikinews to w:sr instead of no which is unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486538 (https://phabricator.wikimedia.org/T214562) [14:13:22] (03PS3) 10Zoranzoki21: Removed namespace Коментар, added namespace Портал on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486536 (https://phabricator.wikimedia.org/T214561) [14:17:18] (03PS2) 10Alexandros Kosiaris: networks: Remove old and deprecated all_networks var [puppet] - 10https://gerrit.wikimedia.org/r/483443 [14:17:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] networks: Remove old and deprecated all_networks var [puppet] - 10https://gerrit.wikimedia.org/r/483443 (owner: 10Alexandros Kosiaris) [14:17:57] (03PS2) 10Muehlenhoff: varnishkafka: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/487883 [14:18:17] (03PS5) 10Mark Bergsma: Use absolute imports for all BGP modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/447809 [14:21:21] (03CR) 10Muehlenhoff: [C: 03+2] varnishkafka: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/487883 (owner: 10Muehlenhoff) [14:24:17] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) [14:25:58] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) no bugs reported since update released [14:26:14] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) [14:26:34] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:29:49] !log add term mysql-dbstore to analytics-in4/6 on cr1/2-eqiad to allow tcp connections to dbstore100[3-5] - T210478 [14:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:52] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [14:29:54] XioNoX: --^ [14:30:10] I haven't added any ipv6 rule since I didn't find any for the mysql hosts [14:30:19] but we can discuss to add them or not, lemme know :) [14:30:25] I added the diff in the task [14:31:50] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:34:56] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) ` diff -Nru libgd2-2.1.0/debian/changelog libgd2-2.1.0/debian/changelog --- libgd2-2.1.0/debian/changelog 2017-08-31 12:31:50.000000000 +0000 +++ libgd2-2.1.0/debian/changelog 2019-01-30 18:03:02.000000000... [14:35:05] (03PS1) 10Muehlenhoff: redis: Stop supporting trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/488436 [14:36:59] !log draining restbase1017 for eventual reboot for kernel security update (bundled with Java update) [14:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:45] (03PS1) 10Alexandros Kosiaris: WIP: Move evaluation of wikimedia_trust/nets to puppet [puppet] - 10https://gerrit.wikimedia.org/r/488445 (https://phabricator.wikimedia.org/T213475) [14:48:38] (03PS6) 10Mark Bergsma: Use absolute imports for all BGP modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/447809 [14:48:43] (03PS3) 10Alexandros Kosiaris: networks: Remove old and deprecated all_networks var [puppet] - 10https://gerrit.wikimedia.org/r/483443 [14:48:48] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] networks: Remove old and deprecated all_networks var [puppet] - 10https://gerrit.wikimedia.org/r/483443 (owner: 10Alexandros Kosiaris) [14:49:36] (03PS7) 10Gehel: icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [14:50:04] 10Operations, 10vm-requests: Site: 1 VM request for recommender-systems - https://phabricator.wikimedia.org/T215421 (10bmansurov) [14:50:21] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [14:50:25] 10Operations, 10vm-requests: Site: 1 VM request for recommender-systems - https://phabricator.wikimedia.org/T215421 (10bmansurov) [14:50:25] !log draining restbase1018 for eventual reboot for kernel security update (bundled with Java update) [14:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:52] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) Thanks, @Dzahn for the info. I've this task: {T215421}. [14:51:40] (03CR) 10Gehel: [C: 03+2] icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [14:52:14] 10Operations, 10Cloud-VPS, 10Toolforge, 10Traffic, 10Patch-For-Review: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10Cyberpower678) This patch doesn't seem to actually fix the issue, just restructur... [14:52:51] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) debian/patches/0032-CVE-2019-6978.patch relates to the suggested [[ https://github.com/libgd/libgd/commit/553702980ae89c83f2d6e254d62cf82e204956d0 | patch ]] debian/patches/0033-CVE-2019-6977.patch relates to the sugg... [14:53:08] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) [14:55:11] (03PS1) 10Marostegui: check_mariadb.py: Add staging port [puppet] - 10https://gerrit.wikimedia.org/r/488451 (https://phabricator.wikimedia.org/T210478) [14:57:52] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) 2.2.4-2+deb9u3 485 -> 2.2.4-2+deb9u4 ` diff -Nru libgd2-2.2.4/debian/changelog libgd2-2.2.4/debian/changelog --- libgd2-2.2.4/debian/changelog 2018-09-07 15:30:40.000000000 +0000 +++ libgd2-2.2.4/debian/changelo... [15:00:59] (03PS1) 10Gehel: elasticsearch: fixed duplicated check description [puppet] - 10https://gerrit.wikimedia.org/r/488453 (https://phabricator.wikimedia.org/T212850) [15:01:05] (03PS1) 10Marostegui: dbstore1003: Increase numbre of instances [puppet] - 10https://gerrit.wikimedia.org/r/488454 (https://phabricator.wikimedia.org/T210478) [15:01:23] (03CR) 10Jcrespo: [C: 03+1] "It should be changed on a lot of places, too- backups, mysql.py, etc. as wmfmariadb is not productionized." [puppet] - 10https://gerrit.wikimedia.org/r/488451 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [15:02:06] (03PS2) 10Marostegui: dbstore1003: Increase number mysql of instances [puppet] - 10https://gerrit.wikimedia.org/r/488454 (https://phabricator.wikimedia.org/T210478) [15:02:24] (03CR) 10Marostegui: "> It should be changed on a lot of places, too- backups, mysql.py," [puppet] - 10https://gerrit.wikimedia.org/r/488451 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [15:02:27] (03CR) 10Marostegui: [C: 03+2] check_mariadb.py: Add staging port [puppet] - 10https://gerrit.wikimedia.org/r/488451 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [15:02:27] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) 2.2.4-2+deb9u3 485 -> 2.2.4-2+deb9u4 Fix-492-Potential-double-free-in-gdImage-Ptr.patch relates to the suggested [[ https://github.com/libgd/libgd/commit/553702980ae89c83f2d6e254d62cf82e204956d0 | patch ]] debian/pat... [15:03:33] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) [15:04:12] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db2055 for performance testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488410 (https://phabricator.wikimedia.org/T93564) (owner: 10Jcrespo) [15:04:45] (03CR) 10Gehel: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:05:15] (03Merged) 10jenkins-bot: mariadb: Depool db2055 for performance testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488410 (https://phabricator.wikimedia.org/T93564) (owner: 10Jcrespo) [15:07:30] PROBLEM - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch http://10.2.1.30:9600/_cluster/health error while fetching: HTTPConnectionPool(host=10.2.1.30, port=9600): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7fc162f1c510: Failed to establish a new connection: [Errno 111] Connecti [15:08:22] (03CR) 10jenkins-bot: mariadb: Depool db2055 for performance testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488410 (https://phabricator.wikimedia.org/T93564) (owner: 10Jcrespo) [15:08:56] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2055 for performance testing T93564 (duration: 00m 55s) [15:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:01] T93564: Addition of last hit date to Special:AbuseFilter table - https://phabricator.wikimedia.org/T93564 [15:09:02] ^ elastic error above is a new check, the issue is probably the check itself (I'm on it) [15:10:58] PROBLEM - ElasticSearch health check for shards on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch http://10.2.2.30:9600/_cluster/health error while fetching: HTTPConnectionPool(host=10.2.2.30, port=9600): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7fe2c599d510: Failed to establish a new connection: [Errno 111] Connecti [15:11:01] !log installing libav security updates [15:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:10] (03PS1) 10Gehel: Revert "icinga: enable check for psi and omega clusters" [puppet] - 10https://gerrit.wikimedia.org/r/488465 [15:11:20] (03PS2) 10Gehel: Revert "icinga: enable check for psi and omega clusters" [puppet] - 10https://gerrit.wikimedia.org/r/488465 [15:12:09] (03CR) 10Gehel: [C: 03+2] Revert "icinga: enable check for psi and omega clusters" [puppet] - 10https://gerrit.wikimedia.org/r/488465 (owner: 10Gehel) [15:12:28] starting at 14:16 we got some very weird mysql errors [15:13:00] they seem to come from mwdebug1002, so maybe some testing? [15:13:21] but it was as if someone was running unit testing on production [15:13:33] or some very broken mediawiki installation [15:13:52] ^ CC _joe_ in case it was some PHP testing [15:14:16] ^ CC greg-g in case it was some weird deplot [15:16:23] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) reverse dependencies | package | restart | note | | dvipng | no | cli tool| | graphviz | no | cli tool| | libgvc6 | no | only revers dep is graphviz | | libnginx-mod-http-image-filter | no | library not used by ngin... [15:16:27] maybe check who was logged into that box at the time? [15:17:03] well, anyone can send queries from that machine [15:17:15] (03PS2) 10Alexandros Kosiaris: WIP: Move evaluation of wikimedia_trust/nets to puppet [puppet] - 10https://gerrit.wikimedia.org/r/488445 (https://phabricator.wikimedia.org/T213475) [15:17:43] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) [15:19:36] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) [15:23:27] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) ` source: libgd2 comment: T215418 update_type: library fixes: stretch: 2.2.4-2+deb9u4 jessie: 2.1.0-5+deb8u12 trusty: libraries: - libgd ` [15:24:49] 10Operations, 10Analytics, 10Research, 10serviceops, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Ottomata) Marco's suggestion of using mwmaint1002 is not a bad idea... [15:24:58] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) [15:27:15] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) [15:27:20] looking at the logs, they seem to come from normal queries, just they seem to have gone crazy for some reason [15:27:56] (03CR) 10Gehel: [C: 04-1] Add wdqs data transfer cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [15:30:29] (03CR) 10Vgutierrez: [C: 03+1] Move Attribute constants from attributes to constants [debs/pybal] - 10https://gerrit.wikimedia.org/r/447808 (owner: 10Mark Bergsma) [15:32:36] (03CR) 10Anomie: Preserve Composer's include paths (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488070 (https://phabricator.wikimedia.org/T215126) (owner: 10Anomie) [15:36:01] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) >>! In T211661#4931056, @fgiunchedi wrote: > And indeed I share the concerns already mentioned, namely making sure we're able to have a bound o... [15:36:20] (03CR) 10Dzahn: "ah of course, i knew the date but forgot about adding it. Erika provided it at https://phabricator.wikimedia.org/T214623#4910053 it's 6/3" [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [15:41:55] !log rebooting cloudvirt1015 to make sure that nothing drastic changes once libguestfs is installed T215423 [15:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:59] T215423: Install libguestfs-tools on cloudvirts? - https://phabricator.wikimedia.org/T215423 [15:42:08] !log installing spice security updates [15:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:55] !log powering down thumbor2002 for disk replacement [15:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:23] 10Operations, 10Security: update libgd2 - https://phabricator.wikimedia.org/T215418 (10jbond) [15:46:03] (03CR) 10CRusnov: [C: 03+1] "Another option inline." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/488204 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:48:16] PROBLEM - Host thumbor2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:45] ^ papaul is replacing a disk probably [15:50:36] yeah, he log it a few lines above [15:50:44] *logged [15:51:41] we need to add some colours tou "Logged" [15:51:44] to [15:52:06] (03PS1) 10Andrew Bogott: nova: install libguestfs-tools on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/488480 (https://phabricator.wikimedia.org/T215423) [15:53:40] RECOVERY - Host thumbor2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 40.32 ms [15:53:47] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10Papaul) a:05Papaul→03jijiki Disk replaced, server didn't boot up. [15:59:26] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10jijiki) @papaul Thank you! I will reimage this server, no need to spend more time on it [16:03:45] 10Operations, 10monitoring: Evaluate/integrate rasdaemon as a replacement for mcelog - https://phabricator.wikimedia.org/T205396 (10jbond) a:03jbond [16:08:09] (03CR) 10Gehel: "minor comments in line" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson) [16:14:23] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2055 for performance testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488483 [16:14:39] !log gehel@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [16:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:08] (03PS3) 10Gehel: admins: create user with analytics-privatedata access for juliaglen [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [16:19:18] !log running alter table on db2055 T93564 [16:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:20] T93564: Addition of last hit date to Special:AbuseFilter table - https://phabricator.wikimedia.org/T93564 [16:21:02] RECOVERY - ElasticSearch health check for shards on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: status: green, number_of_nodes: 30, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1021, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-codfw, relocating_shards: 0, active_shards_percent_as_number: 100.0, [16:21:02] 047, initializing_shards: 0, number_of_data_nodes: 30, delayed_unassigned_shards: 0 [16:22:42] RECOVERY - ElasticSearch health check for shards on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 35, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1031, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 6, active_shards_percent_as_number: 100.0, [16:22:42] 109, initializing_shards: 0, number_of_data_nodes: 35, delayed_unassigned_shards: 0 [16:24:24] jijiki: probably could (add colors) for the logmsgbot !logs, at least [16:24:37] !log reimaging graphite2002 to buster [16:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:52] (03CR) 10Gehel: [C: 03+1] "Very minor comment inline, but otherwise LGTM, feel free to ignore the comment or merge without further review after correction." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486529 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [16:25:03] (03PS1) 10Mathew.onipe: icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/488485 (https://phabricator.wikimedia.org/T212850) [16:25:13] greg-g: yeah, and maybe repeat the the nickname [16:26:29] puppet/hiera question: given a puppet role. Is there a way to know beforehand all hiera lookups that will happen when building the catalog? [16:27:08] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) I did some analysis of how we're using... [16:27:17] (03CR) 10Gehel: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/486530 (owner: 10Volans) [16:28:42] (03CR) 10Hashar: [V: 03+2 C: 03+2] Plugins: Add healthcheck plugin jar [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/488101 (https://phabricator.wikimedia.org/T214326) (owner: 10Thcipriani) [16:29:18] (03CR) 10Gehel: [C: 04-1] "minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488485 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [16:30:05] greg-g: who should I ping for this? [16:30:55] jijiki: I think the logmsgbot code is in operations/puppet, as it's some bash(?) script that's installed on our servers [16:31:08] ok ok tx [16:35:25] (03CR) 10Gehel: [C: 04-1] "Minor comments inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/487982 (https://phabricator.wikimedia.org/T205886) (owner: 10Volans) [16:36:27] (03CR) 10Gehel: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/487981 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [16:37:10] (03PS1) 10Papaul: DHCP: Fix fixed-address name [puppet] - 10https://gerrit.wikimedia.org/r/488496 (https://phabricator.wikimedia.org/T214448) [16:38:11] (03CR) 10Papaul: [V: 03+2 C: 03+2] DHCP: Fix fixed-address name [puppet] - 10https://gerrit.wikimedia.org/r/488496 (https://phabricator.wikimedia.org/T214448) (owner: 10Papaul) [16:43:55] (03CR) 10Cwhite: prometheus: upgrade to node-exporter 0.17 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [16:45:52] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) Actually, it looks like we've got some... [16:46:37] 10Operations, 10Patch-For-Review: ferm: Log dropped packets - https://phabricator.wikimedia.org/T116011 (10herron) While not necessarily optimal, it is possible to ingest a file with rsyslog. So, if left with no other option we may be able to ingest json this way for forwarding on. At the same time I think h... [16:46:58] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) It's true that even if we only clean up a portion of thumbnails, we're already in a good place. The operational goal is to free up space at... [16:49:11] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) What we can do is if we see that the thumbnail already has a X-Delete-After header on get, we update it. If it doesn't have the header, we r... [17:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190206T1700). [17:00:04] MatmaRex and Zoranzoki21: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:01:31] 10Operations, 10Discovery-Search, 10Elasticsearch: Convert check_elasticsearch_shards.py icinga plugin to py3 - https://phabricator.wikimedia.org/T215439 (10Mathew.onipe) [17:01:43] hi [17:02:18] if possible, i would like to go first, i have a meeting later [17:02:25] 10Operations, 10Discovery-Search, 10Elasticsearch: Convert check_elasticsearch_shards.py icinga plugin to py3 - https://phabricator.wikimedia.org/T215439 (10Mathew.onipe) p:05Triage→03Normal [17:02:25] anyone swatting? :) [17:03:38] 10Operations, 10hardware-requests, 10Patch-For-Review: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10Papaul) a:05Papaul→03CDanis - Remove sda from the server - boot the server - server boot without a problem [17:03:56] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Dzahn) Do we have to install the php7.2-redis package? and / or https://stackoverflow.com/quest... [17:04:24] 10Operations, 10hardware-requests, 10Patch-For-Review: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10Papaul) Put back sda in the server. [17:09:12] I suppose I can SWAT (although I am nominally in a meeting :)) [17:09:53] thanks :o [17:10:20] (03CR) 10Hashar: "Random comments :-] Thanks a ton for adding tests to the standard module (which probably be converted to profiles but that is a different " (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [17:10:24] !log setting db1111 in read-write mode [17:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:31] addshore^ [17:10:40] jynus: ty [17:11:12] try now [17:14:49] in a call right now but will be able to try after :) [17:16:35] (03CR) 10Gehel: [C: 04-1] icinga: enable check for psi and omega clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488485 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [17:16:56] (03PS1) 10Jcrespo: mariadb: Set read_only monitoring for core_test hosts [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) [17:17:51] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Set read_only monitoring for core_test hosts [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [17:19:10] (03CR) 10Jcrespo: "Note we could set read_only off by default too-- it shouldn't be on production hosts, but it should be fine for test ones." [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [17:20:13] godog: are youthe person to talk to about grafana? [17:20:29] some really essential grafana boards have gone missing (https://grafana.wikimedia.org/dashboard/db/reading-web-page-previews and https://grafana.wikimedia.org/dashboard/db/reading-web-dashboard?orgId=1) [17:20:48] this is UNBREAK NOW since I need these boards to validate a potential bug that's in production right now [17:21:12] getting "dashboard not found" [17:21:46] (03PS2) 10Jcrespo: mariadb: Set read_only monitoring for core_test hosts [puppet] - 10https://gerrit.wikimedia.org/r/488504 (https://phabricator.wikimedia.org/T172489) [17:22:26] jdlrobson: looks like they moved a bit? https://grafana.wikimedia.org/d/000000340/page-previews?refresh=1m&orgId=1 [17:22:29] jdlrobson https://grafana.wikimedia.org/dashboards/f/8GFIViXmz/readers-web [17:22:45] phew thanks paladox addshore that's reassuring! [17:23:00] jdlrobson: that may have been the url change on upgrade? [17:23:12] strange, because it redirected for me automatically [17:23:19] on previous links I had [17:24:00] it could be that maybe someone editted them and didn't communicate them properly [17:24:09] I think someone changed the names of these when organizing them into folders, so not just the update, the update itself didnt break them / change the names [17:24:14] yupp [17:25:11] MatmaRex: can both of your MobileFrontend changes go out at the same time? Or do I need to do them one-at-a-time? [17:25:24] thcipriani: they should be fine to go out together [17:25:31] great :) [17:26:10] jynus: i can confirm that I can write to that db host now, thanks! [17:26:32] not super urgent, but I sent a review your way [17:26:41] not for the technical details [17:26:50] but to handle read-only monitoring [17:26:59] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) [17:27:03] e.g. should it be read only when codfw is active, etc. [17:27:13] ^addshore [17:27:45] sorry for the problems caused, defaulting to read_only is the safe option for production [17:28:22] thanks for the quick response :D [17:28:53] thcipriani: ooh, i just noticed that wmf.16 is not deployed anywhere yet? mw.org is on wmf.14 [17:29:13] MatmaRex: it's on testwiki only currently [17:29:16] (this is fine by me, it just means we have no way to confirm the fixes in production. but i'm confident they will be fine) [17:29:19] ah, okay [17:29:20] able to test there? [17:29:52] MatmaRex: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/488492 is live on mwdebug1002 if you can test on testwiki, FYI [17:30:01] still waiting on jenkins for the other'n [17:30:26] (03CR) 10Nuria: "Contract is not signed yet. Contract expires on May 31, 2019. we can update puppet but let's wait for contract to be signed before merging" [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [17:30:57] MatmaRex: correction, *now* it's on mwdebug1002 (sorry) [17:32:17] looking [17:37:38] thcipriani: (that one looks good btw) [17:37:55] cool, I'll go ahead and sync that one and prep the other [17:39:59] !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/MobileFrontend: SWAT: [[gerrit:488492|EditorOverlay: Pass constructor of itself to VisualEditorOverlay, not instance]] T215408 (duration: 00m 57s) [17:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:02] T215408: Exception when switching from mobile VE to wikitext: "Uncaught TypeError: e is not a constructor" - https://phabricator.wikimedia.org/T215408 [17:40:22] Hi [17:40:31] Sorry for big lating [17:40:42] Do we have SWAT? [17:40:46] Currently? [17:41:08] MatmaRex: 2nd change is on mwdebug1002, check please [17:41:09] Zoranzoki21: yeah. you're not late yet ;) [17:41:24] ^ [17:41:29] OMG excellent! [17:42:07] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) [17:42:58] thcipriani: it's super slow, i'm trying to test [17:43:11] but things are taking forever to load [17:43:20] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487291 (owner: 10Zoranzoki21) [17:43:35] y'all need to turn that server off and on again or something [17:44:06] thcipriani: 487291 no needs testing [17:44:15] eventually you can check logs :) [17:44:28] (03Merged) 10jenkins-bot: dblists/s3.dblist: Fix sorting of list of wikis per alphabetical order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487291 (owner: 10Zoranzoki21) [17:44:35] thcipriani: does this page load for you when using mwdebug1002? https://test.m.wikipedia.org/w/index.php?title=User:Matma_Rex/sandbox [17:44:44] i tried to refresh it and now it's refusing to load [17:44:46] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to add alaasarhan to the wmde LDAP group - https://phabricator.wikimedia.org/T215066 (10RStallman-legalteam) The NDA is signed and on file now. All set to move forward. Thanks! [17:45:16] MatmaRex: hhvm certainly appears to be struggling on that box [17:45:42] (03PS3) 10Alexandros Kosiaris: WIP: Move evaluation of wikimedia_trust/nets to puppet [puppet] - 10https://gerrit.wikimedia.org/r/488445 (https://phabricator.wikimedia.org/T213475) [17:45:43] oh, it went through this time [17:45:47] 29 seconds [17:46:25] thcipriani: looks good! [17:46:31] It is really slow, I tried to go at https://test.m.wikipedia.org/w/index.php?title=User:Matma_Rex/sandbox [17:46:38] MatmaRex: great, will sync [17:46:47] testwiki doesn't get a lot of hits, guessing the hhvm cache is still cold without mw.org on wmf.16 [17:47:30] hhvm will be changed to php7? [17:48:25] Soon™ is my understand [17:48:27] ing [17:48:35] (03CR) 10jenkins-bot: dblists/s3.dblist: Fix sorting of list of wikis per alphabetical order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487291 (owner: 10Zoranzoki21) [17:49:47] !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/MobileFrontend: SWAT: [[gerrit:488494|VE: Load HTML in parallel with modules]] T209052 (duration: 00m 57s) [17:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:49] T209052: Load page content in parallel with VE code on Mobile with ArticleTargetLoader - https://phabricator.wikimedia.org/T209052 [17:49:54] ^ MatmaRex live now [17:50:22] Zoranzoki21: ok, on to your patches, but I have to leave in 10 minutes so I may not get to all of them :( [17:50:23] (03PS1) 10Giuseppe Lavagetto: tlsproxy::instance: move under profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/488509 [17:50:53] thcipriani: Patches which are priority are 486536 and 486538 [17:51:18] (03CR) 10Dzahn: [C: 04-1] "@Nuria we have 2 conflicting end dates 6/30 vs 5/31 https://phabricator.wikimedia.org/T214623#4910053 and yes, this is definitely waiti" [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [17:51:20] thcipriani: Other patches I can move for next.. You can synchronize 487291 [17:51:33] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486538 (https://phabricator.wikimedia.org/T214562) (owner: 10Zoranzoki21) [17:51:37] k [17:52:38] (03Merged) 10jenkins-bot: Changed wgImportSources for srwikinews to w:sr instead of no which is unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486538 (https://phabricator.wikimedia.org/T214562) (owner: 10Zoranzoki21) [17:53:08] !log thcipriani@deploy1001 Synchronized dblists/s3.dblist: SWAT: [[gerrit:487291|dblists/s3.dblist: Fix sorting of list of wikis per alphabetical order]] (duration: 00m 54s) [17:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:22] ^ Zoranzoki21 there's the dblist one [17:53:28] (03PS1) 10Mathew.onipe: icinga: remove check_elasticsearch_shard command [puppet] - 10https://gerrit.wikimedia.org/r/488511 [17:53:39] thcipriani: ok [17:53:52] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Dzahn) >>! In T214623#4910053, @EBjune wrote: > The official contract end date in our system is 6/... [17:53:56] (03CR) 10Nuria: "From Contract: "remain in full force and effect beginning on January 8, 2019 and" [puppet] - 10https://gerrit.wikimedia.org/r/488120 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [17:54:10] Zoranzoki21: 486538 is live on mwdebug1002, check please [17:54:29] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to add alaasarhan to the wmde LDAP group - https://phabricator.wikimedia.org/T215066 (10Dzahn) a:03Dzahn [17:54:36] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486536 (https://phabricator.wikimedia.org/T214561) (owner: 10Zoranzoki21) [17:54:41] (03Abandoned) 10Paladox: gerrit: Set zuulUrl for plugin zuul-status [puppet] - 10https://gerrit.wikimedia.org/r/487619 (https://phabricator.wikimedia.org/T214068) (owner: 10Paladox) [17:54:50] thcipriani: testing [17:54:54] (03PS2) 10Gehel: icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/488485 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [17:56:17] thcipriani: works [17:56:27] Zoranzoki21: ok, thanks for checking, going live [17:57:31] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:486538|Changed wgImportSources for srwikinews]] T214562 (duration: 00m 53s) [17:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:36] T214562: Enable import from Serbian Wikipedia on srwikinews - https://phabricator.wikimedia.org/T214562 [17:57:36] ^ Zoranzoki21 live now [17:57:59] thcipriani: OK, works. Now 486536 [17:58:45] thcipriani: Maintenance script namespaceDupes.php no needed now [17:58:51] Namespaces are empty [17:59:44] (03PS4) 10Thcipriani: Removed namespace Коментар, added namespace Портал on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486536 (https://phabricator.wikimedia.org/T214561) (owner: 10Zoranzoki21) [17:59:49] (03CR) 10Thcipriani: Removed namespace Коментар, added namespace Портал on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486536 (https://phabricator.wikimedia.org/T214561) (owner: 10Zoranzoki21) [17:59:55] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486536 (https://phabricator.wikimedia.org/T214561) (owner: 10Zoranzoki21) [17:59:57] (03CR) 10jenkins-bot: Changed wgImportSources for srwikinews to w:sr instead of no which is unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486538 (https://phabricator.wikimedia.org/T214562) (owner: 10Zoranzoki21) [18:00:03] (03PS4) 10Alexandros Kosiaris: Move evaluation of wikimedia_trust/nets to puppet [puppet] - 10https://gerrit.wikimedia.org/r/488445 (https://phabricator.wikimedia.org/T213475) [18:00:05] (03PS1) 10Alexandros Kosiaris: varnish: Add new WMCS IP space as trusted [puppet] - 10https://gerrit.wikimedia.org/r/488516 (https://phabricator.wikimedia.org/T213475) [18:00:42] (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler1002/14564/cp3030.esams.wmnet/ says it's effectively a noop" [puppet] - 10https://gerrit.wikimedia.org/r/488445 (https://phabricator.wikimedia.org/T213475) (owner: 10Alexandros Kosiaris) [18:01:07] (03Merged) 10jenkins-bot: Removed namespace Коментар, added namespace Портал on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486536 (https://phabricator.wikimedia.org/T214561) (owner: 10Zoranzoki21) [18:01:31] !log LDAP - adding alaasarhan to wmde (T215066) [18:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:34] T215066: Request to add alaasarhan to the wmde LDAP group - https://phabricator.wikimedia.org/T215066 [18:01:37] Zoranzoki21: live on mwdebug1002, check please [18:01:50] thcipriani: Let me do it [18:02:01] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to add alaasarhan to the wmde LDAP group - https://phabricator.wikimedia.org/T215066 (10Dzahn) [18:02:40] thcipriani: LGTM [18:02:46] Zoranzoki21: going live now [18:02:56] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Nuria) As of today: - contract needs to be signed by a c-level (@EBjune) - contract lists: "remai... [18:03:11] (03CR) 10Gehel: [C: 04-1] icinga: enable check for psi and omega clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/488485 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [18:03:30] 10Operations, 10netops: Spike of multicast traffic - https://phabricator.wikimedia.org/T212273 (10ayounsi) My guess so far is that the recabling triggered a bug in Junos VCF which caused a multicast storm that got propagated to all listeners, filling up links and exhausting resources. After some research, ther... [18:03:32] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Nuria) Final contract might move those dates, if so I will let everyone know when i see it [18:03:55] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:486536|Removed namespace Коментар, added namespace Портал on srwikinews]] T214561 T214563 (duration: 00m 53s) [18:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:59] ^ Zoranzoki21 live now [18:03:59] T214563: Add namespace Портал on srwikinews - https://phabricator.wikimedia.org/T214563 [18:03:59] T214561: Remove namespace Коментар from srwikinews - https://phabricator.wikimedia.org/T214561 [18:05:02] thcipriani: Thanks! [18:05:41] Everything works [18:05:43] (03PS1) 10Dzahn: admins: add Alaa Sarhan to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/488520 (https://phabricator.wikimedia.org/T215066) [18:07:00] (03PS18) 10Cwhite: prometheus: upgrade to node-exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) [18:07:13] (03CR) 10Dzahn: [C: 03+2] admins: add Alaa Sarhan to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/488520 (https://phabricator.wikimedia.org/T215066) (owner: 10Dzahn) [18:08:44] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests, 10Patch-For-Review: Request to add alaasarhan to the wmde LDAP group - https://phabricator.wikimedia.org/T215066 (10Dzahn) 05Open→03Resolved Thanks @RStallman-legalteam , going ahead. @alaa_wmde Done, i added you. Things should... [18:09:01] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to add alaasarhan to the wmde LDAP group - https://phabricator.wikimedia.org/T215066 (10Dzahn) [18:12:22] (03CR) 10jenkins-bot: Removed namespace Коментар, added namespace Портал on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486536 (https://phabricator.wikimedia.org/T214561) (owner: 10Zoranzoki21) [18:16:58] (03PS2) 10Mathew.onipe: maps: migrate maps2004 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/487360 (https://phabricator.wikimedia.org/T198622) [18:17:29] (03CR) 10Mathew.onipe: maps: migrate maps2004 to stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487360 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [18:24:24] (03CR) 10Cwhite: [C: 03+2] aptrepo: add prometheus-node-exporter components for all dists [puppet] - 10https://gerrit.wikimedia.org/r/486493 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [18:24:35] (03PS4) 10Cwhite: aptrepo: add prometheus-node-exporter components for all dists [puppet] - 10https://gerrit.wikimedia.org/r/486493 (https://phabricator.wikimedia.org/T213708) [18:28:01] (03CR) 10Jcrespo: [C: 03+1] "blocked on SWAT, but ready to deploy any time now (alter cleaned up)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488483 (owner: 10Jcrespo) [18:32:51] (03PS1) 10Krinkle: mediawiki: Remove beta-cluster specific auto_prepend_file override [puppet] - 10https://gerrit.wikimedia.org/r/488524 (https://phabricator.wikimedia.org/T176370) [18:33:23] (03PS3) 10Krinkle: PhpAutoPrepend: Remove PhpAutoPrepend-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486177 [18:38:02] (03PS1) 10Andrew Bogott: Remove unused role::wmcs::web_interfaces [puppet] - 10https://gerrit.wikimedia.org/r/488526 (https://phabricator.wikimedia.org/T215443) [18:40:58] (03PS2) 10Andrew Bogott: Remove unused role::wmcs::web_interfaces [puppet] - 10https://gerrit.wikimedia.org/r/488526 (https://phabricator.wikimedia.org/T215443) [18:42:28] (03CR) 10Andrew Bogott: [C: 03+2] Remove unused role::wmcs::web_interfaces [puppet] - 10https://gerrit.wikimedia.org/r/488526 (https://phabricator.wikimedia.org/T215443) (owner: 10Andrew Bogott) [18:45:29] (03PS1) 10Cwhite: prometheus: make rules and alerts configuration backwards compatible in beta [puppet] - 10https://gerrit.wikimedia.org/r/488530 [18:45:45] (03PS2) 10Andrew Bogott: nova: install libguestfs-tools on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/488480 (https://phabricator.wikimedia.org/T215423) [18:46:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "A combination of this and of a hard timeout at the php-fpm level should work decently well in reproducing the wall-clock timeout we get in" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487069 (owner: 10Tim Starling) [18:47:01] (03CR) 10EBernhardson: mwgrep: Query all search clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) (owner: 10EBernhardson) [18:47:03] (03PS5) 10EBernhardson: mwgrep: Query all search clusters [puppet] - 10https://gerrit.wikimedia.org/r/487924 (https://phabricator.wikimedia.org/T215199) [18:47:28] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Tgr) IIRC (we ran into similar issues on Vagrant in {T213016}) there is no php7.2-redis, just a si... [18:48:07] (03CR) 10Andrew Bogott: [C: 03+2] nova: install libguestfs-tools on compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/488480 (https://phabricator.wikimedia.org/T215423) (owner: 10Andrew Bogott) [18:48:49] (03PS4) 10Cwhite: role: add backwards-compatibility rules to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708) [18:55:57] (03PS5) 10Cwhite: role: add backwards-compatibility rules to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708) [18:56:12] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Reedy) In `modules/contint/manifests/packages/php.pp` we're doing `ensure => latest` ` reedy@depl... [18:59:20] (03PS1) 10Elukey: Add staging-db-analytics.eqiad.wmnet CNAME to dbstore1003 [dns] - 10https://gerrit.wikimedia.org/r/488535 (https://phabricator.wikimedia.org/T210478) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190206T1900) [19:01:10] (03PS1) 10Dzahn: admins: add Hana Worku to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/488536 (https://phabricator.wikimedia.org/T215352) [19:03:07] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.14/extensions/Flow/includes/Conversion/Utils.php: I405dd193 Update Parsoid Accept header to 2.0.0 so service can deploy (duration: 00m 56s) [19:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:09] (03PS1) 10Dzahn: admins: add Eric Gardner to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/488537 (https://phabricator.wikimedia.org/T214654) [19:04:32] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/Flow/includes/Conversion/Utils.php: I405dd193 Update Parsoid Accept header to 2.0.0 so service can deploy (duration: 00m 54s) [19:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:41] (03PS22) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [19:06:43] (03CR) 10Eric Gardner: [C: 03+1] "This all looks correct to me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/488537 (https://phabricator.wikimedia.org/T214654) (owner: 10Dzahn) [19:07:39] (03PS1) 10Herron: logstash: curator: re-order replica prune to occur before forcemerge [puppet] - 10https://gerrit.wikimedia.org/r/488538 (https://phabricator.wikimedia.org/T213078) [19:09:46] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@3272a46]: Add healthcheck plugin (no restart) gerrit2001 first [19:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:50] !log LDAP - adding afandian2 and toddleroux to nda (T214727) [19:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:53] T214727: LDAP nda access for afandian2 and toddleroux - https://phabricator.wikimedia.org/T214727 [19:09:57] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@3272a46]: Add healthcheck plugin (no restart) gerrit2001 first (duration: 00m 10s) [19:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:02] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: curator: re-order replica prune to occur before forcemerge [puppet] - 10https://gerrit.wikimedia.org/r/488538 (https://phabricator.wikimedia.org/T213078) (owner: 10Herron) [19:11:27] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@3272a46]: Add healthcheck plugin (no restart) cobalt T214326 [19:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:30] T214326: Install "healthcheck" plugin on gerrit - https://phabricator.wikimedia.org/T214326 [19:11:36] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@3272a46]: Add healthcheck plugin (no restart) cobalt T214326 (duration: 00m 09s) [19:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:45] (03PS2) 10Herron: logstash: curator: re-order replica prune to occur before forcemerge [puppet] - 10https://gerrit.wikimedia.org/r/488538 (https://phabricator.wikimedia.org/T213078) [19:11:52] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: upgrade to node-exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [19:11:56] 10Operations, 10ops-ulsfo, 10decommission: decommission/replace bast4001.wikimedia.org - https://phabricator.wikimedia.org/T178592 (10RobH) [19:12:03] !log milimetric@deploy1001 Started deploy [analytics/refinery@cd413dd]: Small bug fix for history checker [19:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:27] !!log taking cp4026 offline to flash firmware and reseat dimm for testing on T214516 [19:13:28] T214516: cp4026 correctable dimm error - https://phabricator.wikimedia.org/T214516 [19:13:36] bahhhh, too many ! [19:13:39] !log taking cp4026 offline to flash firmware and reseat dimm for testing on T214516 [19:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:32] 10Operations, 10ops-codfw, 10decommission: Decom mw2213 - https://phabricator.wikimedia.org/T203434 (10Papaul) [19:16:35] (03PS23) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [19:16:47] 10Operations, 10ops-ulsfo, 10Traffic: cp4026 correctable dimm error - https://phabricator.wikimedia.org/T214516 (10RobH) ` robh@cp4026:~$ sudo ipmi-sel ID | Date | Time | Name | Type | Event 1 | Apr-23-2017 | 23:39:37 | SEL | Event Logging Disabled... [19:17:07] 10Operations, 10Cloud-Services, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287 (10Bstorm) I think I see what's up here. In [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/etcd/manifest... [19:18:36] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Dzahn) >>! In T215376#4932577, @Reedy wrote: > In `modules/contint/manifests/packages/php.pp` we'r... [19:19:12] (03PS24) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [19:19:37] (03CR) 10Filippo Giunchedi: [C: 03+1] role: add backwards-compatibility rules to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/485889 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [19:20:09] (03CR) 10Herron: [C: 03+2] logstash: curator: re-order replica prune to occur before forcemerge [puppet] - 10https://gerrit.wikimedia.org/r/488538 (https://phabricator.wikimedia.org/T213078) (owner: 10Herron) [19:22:53] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) >>! In T205487#4932700, @Cmjohnson wrote: > @ayounsi This is the contents that I am shipping to eqsin. Please confirm that is all you need > > 12 SFP-10GLR Transceivers > (2) 3M LC-... [19:24:49] !log milimetric@deploy1001 Finished deploy [analytics/refinery@cd413dd]: Small bug fix for history checker (duration: 12m 45s) [19:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:51] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4026_v4, cp4026_v6 [19:24:53] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4026_v4, cp4026_v6 [19:24:59] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4026_v4, cp4026_v6 [19:25:03] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4026_v4, cp4026_v6 [19:25:05] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4026_v4, cp4026_v6 [19:25:09] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4026_v4, cp4026_v6 [19:25:11] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4026_v4, cp4026_v6 [19:25:13] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4026_v4, cp4026_v6 [19:25:21] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4026_v4, cp4026_v6 [19:25:23] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4026_v4, cp4026_v6 [19:25:23] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4026_v4, cp4026_v6 [19:25:25] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4026_v4, cp4026_v6 [19:25:29] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4026_v4, cp4026_v6 [19:25:29] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4026_v4, cp4026_v6 [19:25:30] (03PS25) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [19:25:37] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4026_v4, cp4026_v6 [19:25:39] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4026_v4, cp4026_v6 [19:25:39] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4026_v4, cp4026_v6 [19:25:52] that is expected [19:26:02] i took down cp4026, it will ipsec alert [19:26:05] PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4026_v4, cp4026_v6 [19:26:05] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4026_v4, cp4026_v6 [19:26:07] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4026_v4, cp4026_v6 [19:26:15] its better to have it spam with that host being down than miss others [19:47:32] 10Operations, 10ops-eqiad: WMF7426 fails to accept racadm powercycle commands - https://phabricator.wikimedia.org/T215338 (10Cmjohnson) pulled the power to do a hard reset but the function still does not work. Attempting to update idrac to latest version [19:50:53] !log updated firmware on cp4026 and re-seated (already well seated) dimm b3. errors have cleared for now T214516 [19:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:56] T214516: cp4026 correctable dimm error - https://phabricator.wikimedia.org/T214516 [19:53:17] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK [19:53:17] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 64 ESP OK [19:53:17] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [19:53:39] RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 72 ESP OK [19:53:39] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 72 ESP OK [19:53:41] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 72 ESP OK [19:53:47] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 72 ESP OK [19:53:49] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [19:53:53] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 72 ESP OK [19:54:01] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [19:54:03] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [19:54:07] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 72 ESP OK [19:54:07] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 64 ESP OK [19:54:09] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [19:54:13] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 72 ESP OK [19:54:19] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 72 ESP OK [19:54:21] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [19:54:21] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [19:54:25] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [19:54:27] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [19:55:33] 10Operations, 10ops-ulsfo, 10Traffic: cp4026 correctable dimm error - https://phabricator.wikimedia.org/T214516 (10RobH) 05Open→03Resolved Ok, things I did to fix this system so far: * set system and services/mgmt to maint mode for 2 hours * updated task with full SEL log output * powered off system * u... [19:55:43] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T215050 (10Cmjohnson) The disk has been replaced but I also a bad disk on slot 6. leaving this open until tomorrow and will replace it [19:55:46] 10Operations, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Later), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10EvanProdromou) a:05Eevans→03EvanProdromou So, I'm going to try to get some numbers o... [20:00:04] twentyafterfour: That opportune time is upon us again. Time for a MediaWiki train - Americas version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190206T2000). [20:03:16] !log Resuming the MediaWiki train for version 1.33.0-wmf.16. Will deploy Group0 wikis first and then catch up to group1 after a few minutes monitoring logs for stability. [20:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:44] (03PS1) 1020after4: group0 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488555 [20:12:46] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488555 (owner: 1020after4) [20:13:58] (03Merged) 10jenkins-bot: group0 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488555 (owner: 1020after4) [20:17:59] 10Operations, 10ops-eqiad: WMF7426 fails to accept racadm powercycle commands - https://phabricator.wikimedia.org/T215338 (10Cmjohnson) 05Open→03Resolved @RobH updated f/w and bios....all is well. resolving root@wmf7426.mgmt.eqiad.wmnet's password: /admin1-> racadm serveraction powercycle Server power ope... [20:20:28] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 2 others: Install "healthcheck" plugin on gerrit - https://phabricator.wikimedia.org/T214326 (10thcipriani) 05Open→03Resolved a:03thcipriani ` (/^ヮ^)/*:・゚✧ curl 'https://gerrit.wikimedia.org/r/config/server/healthcheck~status' )]}' { "elapsed": 30... [20:25:19] 10Operations, 10Gerrit, 10Release-Engineering-Team: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Paladox) [20:25:39] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.33.0-wmf.16 refs T206670 [20:27:41] twentyafterfour@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [20:27:42] T206670: 1.33.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T206670 [20:28:45] (03CR) 10jenkins-bot: group0 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488555 (owner: 1020after4) [20:30:21] 10Operations, 10Gerrit, 10Icinga, 10Release-Engineering-Team, and 2 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Dzahn) [20:30:33] 10Operations, 10Gerrit, 10Icinga, 10Release-Engineering-Team, and 2 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Dzahn) p:05Triage→03Normal a:03Dzahn [20:35:37] 10Operations, 10Gerrit, 10Icinga, 10Release-Engineering-Team, and 2 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Dzahn) So you said checking for status 200 is enough here? We don't need to bother looking for the string "Passed" or some... [20:35:38] !log 1.33.0-wmf.16 has a significantly higher rate of "entire web request took longer than 60 seconds and timed out" [20:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:49] (03PS2) 10Dzahn: admins: add Eric Gardner to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/488537 (https://phabricator.wikimedia.org/T214654) [20:37:30] 10Operations, 10Gerrit, 10Icinga, 10Release-Engineering-Team, and 2 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Paladox) Yup, if any of the checks fail it will return 500. See https://gerrit.googlesource.com/plugins/healthcheck/#how-... [20:37:53] (03CR) 10Dzahn: [C: 03+2] "new staff per [foundation-official] Staff and Contractors Digest: 31 January 2019 Edition" [puppet] - 10https://gerrit.wikimedia.org/r/488537 (https://phabricator.wikimedia.org/T214654) (owner: 10Dzahn) [20:38:01] so web request timeouts got much worse after rolling group0 to 1.33.0-wmf.16, however, I can't see any obvious culprit, no other corresponding error messages that seem to be a root cause [20:39:05] something just got a lot slower (the timeouts happen in apparently random places so the file:line of the exception isn't very helpful) [20:40:11] grr, I guess it's just hhvm cache invalidation slowness? seems the spike is subsiding. [20:40:44] !log LDAP - adding egardner to wmf - welcome Eric Gardner , software engineer in Audiences (T214654) [20:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:49] T214654: Add Eric Gardner to `wmf` LDAF group - https://phabricator.wikimedia.org/T214654 [20:41:29] It sucks when every deployment results in transient spikes in error logs because it becomes very difficult to decide whether a rollback is necessary [20:41:51] (03CR) 10Dzahn: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/488537 (https://phabricator.wikimedia.org/T214654) (owner: 10Dzahn) [20:42:27] * twentyafterfour is tempted to filter (hide) the timeout errors from kibana [20:44:44] (03PS2) 10Dzahn: admins: add Hana Worku to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/488536 (https://phabricator.wikimedia.org/T215352) [20:46:28] (03CR) 10Dzahn: [C: 03+2] "new staff per "[foundation-optional] Please join me in a belated welcome for Hana Worku!"" [puppet] - 10https://gerrit.wikimedia.org/r/488536 (https://phabricator.wikimedia.org/T215352) (owner: 10Dzahn) [20:49:00] !log LDAP - adding h78na to wmf - welcome Hana Worku, developer on the multimedia team (T215352) [20:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:02] T215352: Add ha78na to `wmf` LDAP group. - https://phabricator.wikimedia.org/T215352 [20:59:54] 10Operations, 10Proton, 10Security-Team, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q3), 10Reading-Infrastructure-Team-Backlog (Kanban): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10pmiazga) @Tgr so far we were bumping Puppeteer versi... [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190206T2100). [21:01:04] 10Operations, 10ops-eqiad: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10herron) Hey @Cmjohnson, sending a friendly ping to see how these builds are going. If there's anything I can do to assist remotely just let me know. [21:01:25] !log arlolra@deploy1001 Started deploy [parsoid/deploy@a4acfa6]: Updating Parsoid to fb67a71 [21:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:04] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10pmiazga) @Tgr I assume you're still waiting for answers from @ema? Is there anything I can help you with? [21:04:04] !log krinkle@webperf1001 Kill xenon-log (pid 449). It seems its Redis TCP socket to mwlog1001 has been stuck since Dec 13, causing the process to indefinitely hang on listen()/socket.recv() [21:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:15] webperf1002 * [21:05:08] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@a4acfa6]: Updating Parsoid to fb67a71 (duration: 03m 43s) [21:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:22] 10Operations, 10ops-eqiad: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10Cmjohnson) hi @herron they are not going just yet. I will get to them next week. [21:09:20] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T215050 (10Marostegui) @Cmjohnson you can proceed with the one on slot 6. The one on slot #1 finished correctly ` Enclosure Device ID: 32 Slot Number: 1 Drive's position: DiskGroup: 0, Span: 0, Arm: 1 Enclosure pos... [21:15:37] PROBLEM - Check systemd state on cp4026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:16:01] PROBLEM - HTTPS Unified RSA on cp4026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:16:09] PROBLEM - HTTPS Unified ECDSA on cp4026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:16:17] PROBLEM - puppet last run on cp4026 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [21:45:27] !log LDAP - adding brennen to wmf, releng, ciadmin - Welcome Brennan Bearnes, Software Engineer in Release Engineering (T215365 T214556) [21:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:35] T214556: Onboarding Brennen - https://phabricator.wikimedia.org/T214556 [21:45:35] T215365: LDAP requests for Brennen Bearnes: wmf, releng, ciadmin - https://phabricator.wikimedia.org/T215365 [21:46:49] everybody got hired at allhands or we just hire that much [21:53:13] we really just hire that much [21:54:35] wow. yea, then this is just because we didnt have clinic duty due to allhands [21:56:02] (03PS1) 10Dzahn: admins: add Brennen Bearnes to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/488587 (https://phabricator.wikimedia.org/T215365) [21:56:13] (03PS1) 10EBernhardson: Turn off wbsearchentities ab test in de, fr, es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488588 (https://phabricator.wikimedia.org/T214515) [21:58:38] (03CR) 10Dzahn: [C: 03+2] "done per "[foundation-official] Staff and Contractors Digest: 31 January 2019 Edition" and welcome thread" [puppet] - 10https://gerrit.wikimedia.org/r/488587 (https://phabricator.wikimedia.org/T215365) (owner: 10Dzahn) [22:06:44] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) [22:07:04] (03PS19) 10Cwhite: prometheus: upgrade to node-exporter 0.17 [puppet] - 10https://gerrit.wikimedia.org/r/486192 (https://phabricator.wikimedia.org/T213708) [22:12:33] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: Analytics query access for search platform NLP contractor @Julia.glen - https://phabricator.wikimedia.org/T214623 (10Dzahn) 05Open→03Stalled Thanks @Nuria! setting status to stalled to reflect that we should wait. [22:13:31] 10Operations, 10SRE-Access-Requests: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10Dzahn) [22:13:46] 10Operations, 10SRE-Access-Requests: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10Dzahn) p:05Triage→03Normal [22:14:48] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to researchers group for phuedx - https://phabricator.wikimedia.org/T214957 (10Dzahn) [22:15:11] (03PS1) 10Cwhite: hiera: install node exporter 0.17 in beta [puppet] - 10https://gerrit.wikimedia.org/r/488593 (https://phabricator.wikimedia.org/T213708) [22:15:34] (03PS1) 1020after4: group1 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488594 [22:15:37] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488594 (owner: 1020after4) [22:16:47] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488594 (owner: 1020after4) [22:18:02] (03PS1) 10Dzahn: admins: add phuedx to researchers [puppet] - 10https://gerrit.wikimedia.org/r/488595 (https://phabricator.wikimedia.org/T214957) [22:20:23] (03PS2) 10Dzahn: admins: add phuedx to researchers [puppet] - 10https://gerrit.wikimedia.org/r/488595 (https://phabricator.wikimedia.org/T214957) [22:20:57] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:21:59] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.16 refs T206670 [22:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:22] T206670: 1.33.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T206670 [22:22:53] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.16 refs T206670 (duration: 00m 53s) [22:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:01] 10Operations, 10SRE-Access-Requests: Requesting access to deployment, contint-admins, and contint-docker for Brennen Bearnes - https://phabricator.wikimedia.org/T215328 (10Dzahn) p:05Triage→03High [22:25:30] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.16 refs T206670 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/488594 (owner: 1020after4) [22:32:21] (03CR) 10Dzahn: "looks good, just nitpick, agree to what Moritz already said, move to the end of the list please" [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [22:34:26] (03CR) 10Nuria: [C: 03+1] admins: add phuedx to researchers [puppet] - 10https://gerrit.wikimedia.org/r/488595 (https://phabricator.wikimedia.org/T214957) (owner: 10Dzahn) [22:35:53] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr) Yeah, but I don't think this task should be a blocker (for either handover or production switchover).... [22:36:07] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 3 others: Document and possibly fine-tune how Proton interacts with Varnish - https://phabricator.wikimedia.org/T213371 (10Tgr) [22:43:42] (03PS1) 10Herron: lists:warn if unknown host issues mail from cmd containing our domain [puppet] - 10https://gerrit.wikimedia.org/r/488602 (https://phabricator.wikimedia.org/T215251) [22:47:06] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) second NIC configuration cloudvirt2001-dev ` Logical Vlan TAG MAC STP Logical Tagging... [22:47:35] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) [22:50:02] 10Operations, 10Cloud-Services, 10Patch-For-Review: rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Papaul) a:05Papaul→03aborrero @aborrero @Andrew all yours . Let me know if you have any questions. [22:51:59] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:00:22] (03PS1) 10Dzahn: admins: create gpu-testers, add ebernhardson, root on stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/488606 (https://phabricator.wikimedia.org/T215384) [23:03:38] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review, 10User-herron: Ban recurrent spam to Wikimedia mailing lists (January 2019) - https://phabricator.wikimedia.org/T215251 (10herron) Progress! (I hope...) https://gerrit.wikimedia.org/r/488602 adds an acl to detect unknown/untrusted hosts who are a... [23:07:48] (03PS2) 10Herron: lists:warn if unknown host issues mail from cmd containing our domain [puppet] - 10https://gerrit.wikimedia.org/r/488602 (https://phabricator.wikimedia.org/T215251) [23:20:43] PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:36:32] /14/8 [23:47:52] (03CR) 10Alex Monk: certcentral: Implement staging time (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [23:56:15] RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational [23:58:37] !log restarting icinga on icinga1001 to pick up new check command ? [23:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log