[00:58:27] (03CR) 10GTirloni: [C: 03+1] "Change looks good to be but I have close to zero business knowledge of the impact, sorry :) I know we do data sanitization and exposing ip" [puppet] - 10https://gerrit.wikimedia.org/r/489576 (https://phabricator.wikimedia.org/T209819) (owner: 10BryanDavis) [01:00:35] (03CR) 10GTirloni: [C: 03+1] "I like this approach. Would it make sense to add a comment to lines that have the WMF-specific changes? It might be useful in the future w" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/489409 (https://phabricator.wikimedia.org/T178601) (owner: 10BryanDavis) [01:02:20] (03CR) 10GTirloni: [C: 03+1] "Not an ideal situation we find ourselves in but it's a reality. I'd be okay with this stopgap solution for now and we're updating Kubernet" [puppet] - 10https://gerrit.wikimedia.org/r/489291 (https://phabricator.wikimedia.org/T215586) (owner: 10Bstorm) [01:02:40] (03CR) 10GTirloni: [C: 03+1] Refactor and simplify python package [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/486417 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [02:11:36] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 16843952 and 0 seconds [02:19:11] (03PS1) 10Smalyshev: Enable WikibaseCirrusSearch on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489598 (https://phabricator.wikimedia.org/T215684) [02:19:49] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 7912 and 20 seconds [02:44:17] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10Arrbee) >>! In T212197#4939326, @dr0ptp4kt wrote: > > @Arrbee would you please confirm the expected ship date for ExternalGuidance being... [03:00:58] !log kartik@deploy1001 Started deploy [cxserver/deploy@ee4a15a]: Update cxserver to 8928852 (T213256) [03:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:02] T213256: CX2: Outputs unnecessary self-closing nowiki and div tags inside ref tags - https://phabricator.wikimedia.org/T213256 [03:05:06] !log kartik@deploy1001 Finished deploy [cxserver/deploy@ee4a15a]: Update cxserver to 8928852 (T213256) (duration: 04m 08s) [03:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:42] (03CR) 10Andrew Bogott: "@moritz, I'm 95% confident that this doesn't present any additional security risks but please let me know if there's some reason to fear t" [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [03:34:54] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18521872 [03:36:12] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 6296 [04:10:25] !log tstarling@deploy1001 sync-file aborted: test-only undeployed change (duration: 00m 12s) [04:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:00] !log tstarling@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/NavigationTiming/tests/ext.navigationTiming.test.js: test-only undeployed change (duration: 00m 51s) [04:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:19:02] !log tstarling@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/WikimediaEvents/tests/phpunit/PageViewsTest.php: test-only undeployed change (duration: 00m 46s) [04:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:19:55] !log tstarling@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/AbuseFilter/maintenance/normalizeThrottleParameters.php: maintenance script update for new dry run (duration: 00m 47s) [04:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:12] !log on mwmaint1002: running normalizeThrottleParameters.php --dry-run on all wikis (T209565) [04:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:15] T209565: Dry run for normalizeThrottleParameters.php - https://phabricator.wikimedia.org/T209565 [04:25:06] (03CR) 10BryanDavis: "> I like this approach. Would it make sense to add a comment to lines" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/489409 (https://phabricator.wikimedia.org/T178601) (owner: 10BryanDavis) [05:22:53] 10Operations, 10CirrusSearch, 10serviceops, 10Discovery-Search (Current work), 10Patch-For-Review: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10Joe) @debt the work is not done - I still have to merge a change to add the proxy to the deployme... [05:44:18] (03CR) 10Reedy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489598 (https://phabricator.wikimedia.org/T215684) (owner: 10Smalyshev) [05:56:51] (03PS1) 10Giuseppe Lavagetto: tmpreaper: do not break systemd units using a PrivateTmp directory [puppet] - 10https://gerrit.wikimedia.org/r/489611 [06:06:51] (03CR) 10Marostegui: [C: 03+2] dbstore.my.cnf: Remove the second skip-slave-start [puppet] - 10https://gerrit.wikimedia.org/r/489450 (https://phabricator.wikimedia.org/T213670) (owner: 10Marostegui) [06:08:01] 10Operations, 10Gerrit, 10Icinga, 10monitoring, and 3 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10greg) [06:08:53] (03PS1) 10Tulsi Bhagat: Create 'extendedconfirmed' user group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) [06:09:40] (03CR) 10jerkins-bot: [V: 04-1] Create 'extendedconfirmed' user group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) (owner: 10Tulsi Bhagat) [06:10:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489613 (https://phabricator.wikimedia.org/T210713) [06:12:29] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489613 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:13:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489613 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:14:00] !log marostegui@deploy1001 sync-file aborted: Depool db0179 (duration: 00m 01s) [06:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1079 (duration: 00m 48s) [06:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:34] 10Operations, 10ops-eqiad, 10monitoring: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Marostegui) Thanks for clarifying that @volans! Then probably failing over to another host is a good idea so we can debug icinga1001 without having service interruptions Thanks! [06:22:59] (03CR) 10Tulsi Bhagat: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) (owner: 10Tulsi Bhagat) [06:23:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489613 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:23:40] (03CR) 10jerkins-bot: [V: 04-1] Create 'extendedconfirmed' user group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) (owner: 10Tulsi Bhagat) [06:37:43] (03PS1) 10Tulsi Bhagat: Add https://polona.pl/ to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489614 (https://phabricator.wikimedia.org/T215501) [06:47:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489615 [06:48:56] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489615 (owner: 10Marostegui) [06:49:59] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489615 (owner: 10Marostegui) [06:51:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1079 (duration: 00m 48s) [06:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 - mysql ugrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489616 [06:56:34] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489615 (owner: 10Marostegui) [06:56:57] 10Operations, 10Icinga, 10monitoring: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Marostegui) All the passive checks went awol just now. I tested a downtime to db1100 and it didn't work (either using icinga-downtime or the icinga web ui) While taili... [07:00:27] !log Restart icinga on icinga1001 - checks went awol [07:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:49] 10Operations, 10Icinga, 10monitoring: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Marostegui) I restarted icinga and they are recovering and downtimes are working again [07:04:23] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1100 - mysql ugrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489616 (owner: 10Marostegui) [07:05:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 - mysql ugrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489616 (owner: 10Marostegui) [07:06:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1100 for mysql upgrade (duration: 00m 47s) [07:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:42] !log Upgrade MySQL on db1100 [07:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 - mysql ugrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489616 (owner: 10Marostegui) [07:08:43] (03PS3) 10Elukey: Add analytics dbstore SRV records [dns] - 10https://gerrit.wikimedia.org/r/489170 (https://phabricator.wikimedia.org/T212386) [07:10:06] (03CR) 10Elukey: [C: 03+2] Add analytics dbstore SRV records [dns] - 10https://gerrit.wikimedia.org/r/489170 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [07:12:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/489611 (owner: 10Giuseppe Lavagetto) [07:14:00] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489619 [07:15:45] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489619 (owner: 10Marostegui) [07:16:55] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489619 (owner: 10Marostegui) [07:17:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1100 with low weight (duration: 00m 46s) [07:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:48] !log Stop all mysql instances on dbstore1004 for a reboot [07:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:23] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489619 (owner: 10Marostegui) [07:23:19] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489621 [07:25:35] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489621 (owner: 10Marostegui) [07:26:39] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489621 (owner: 10Marostegui) [07:27:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give api traffic to db1100 (duration: 00m 46s) [07:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:12] (03PS1) 10Marostegui: dbstore1004: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/489622 (https://phabricator.wikimedia.org/T210478) [07:30:48] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489621 (owner: 10Marostegui) [07:31:06] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Tgr) >>! In T215071#4941085, @Krenair wrote: > How exactly would certs from LetsEncrypt be a downgrade in security? I'm not an HTTPS expert but... [07:36:11] (03PS5) 10Mathew.onipe: elasticsearch_cluster: fix issues from test result [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) [07:37:44] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489624 [07:39:22] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Reedy) >>! In T215071#4942405, @Tgr wrote: >>>! In T215071#4941085, @Krenair wrote: >> How exactly would certs from LetsEncrypt be a downgrade i... [07:39:55] !log Deploy schema change on s7 primary master (db1062) - T210713 [07:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:58] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [07:41:21] (03PS1) 10Muehlenhoff: Drop requires_os checks for trusty and similar consitency checks for systemd [puppet] - 10https://gerrit.wikimedia.org/r/489625 [07:41:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489624 (owner: 10Marostegui) [07:42:20] (03CR) 10jerkins-bot: [V: 04-1] Drop requires_os checks for trusty and similar consitency checks for systemd [puppet] - 10https://gerrit.wikimedia.org/r/489625 (owner: 10Muehlenhoff) [07:42:38] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489624 (owner: 10Marostegui) [07:42:50] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489624 (owner: 10Marostegui) [07:43:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1100 (duration: 00m 46s) [07:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:38] (03CR) 10Mathew.onipe: elasticsearch_cluster: fix issues from test result (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [07:46:54] (03PS2) 10Muehlenhoff: Drop requires_os checks for trusty [puppet] - 10https://gerrit.wikimedia.org/r/489625 [07:46:58] (03PS1) 10KartikMistry: WIP: Add ExternalGuidance extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489627 (https://phabricator.wikimedia.org/T213076) [07:48:58] (03PS1) 10Mathew.onipe: add temp lvs enabled option [cookbooks] - 10https://gerrit.wikimedia.org/r/489628 (https://phabricator.wikimedia.org/T207920) [07:55:57] (03PS3) 10Muehlenhoff: contint::packages::php: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/488005 [07:58:36] (03CR) 10Muehlenhoff: [C: 03+2] contint::packages::php: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/488005 (owner: 10Muehlenhoff) [08:05:13] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Krenair) >>! In T215071#4942408, @Reedy wrote: >>>! In T215071#4942405, @Tgr wrote: >>>>! In T215071#4941085, @Krenair wrote: >>> How exactly wo... [08:07:57] !log Deploy schema change on s8 codfw master (db2045) - this will generate lag on codfw T210713 [08:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:00] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [08:08:20] (03PS1) 10Muehlenhoff: postgresql: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/489631 [08:08:55] (03CR) 10Marostegui: [C: 03+2] dbstore1004: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/489622 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [08:09:02] (03PS2) 10Marostegui: dbstore1004: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/489622 (https://phabricator.wikimedia.org/T210478) [08:14:06] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10MoritzMuehlenhoff) a:05RobH→03Cmjohnson [08:17:16] !log removed cloudcontrol2001-dev.codfw.wmnet from debmonitor (actual hostname in use is cloudcontrol2001-dev.wikimedia.org) [08:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:59] (03PS5) 10Elukey: superset: add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria) [08:29:39] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14590/analytics-tool1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria) [08:29:46] (03PS6) 10Elukey: superset: add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria) [08:29:51] (03CR) 10Elukey: [V: 03+2 C: 03+2] superset: add ability for superset to connect to new staging DB [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria) [08:35:49] (03PS1) 10Marostegui: dbstore analytics: Move staging database [puppet] - 10https://gerrit.wikimedia.org/r/489633 (https://phabricator.wikimedia.org/T210478) [08:36:21] (03CR) 10jerkins-bot: [V: 04-1] dbstore analytics: Move staging database [puppet] - 10https://gerrit.wikimedia.org/r/489633 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [08:37:14] (03PS2) 10Marostegui: dbstore analytics: Move staging database [puppet] - 10https://gerrit.wikimedia.org/r/489633 (https://phabricator.wikimedia.org/T210478) [08:37:44] (03CR) 10jerkins-bot: [V: 04-1] dbstore analytics: Move staging database [puppet] - 10https://gerrit.wikimedia.org/r/489633 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [08:38:15] (03PS3) 10Marostegui: dbstore analytics: Move staging database [puppet] - 10https://gerrit.wikimedia.org/r/489633 (https://phabricator.wikimedia.org/T210478) [08:51:35] (03PS1) 10Muehlenhoff: Only install mcelog on jessie and stretch [puppet] - 10https://gerrit.wikimedia.org/r/489635 (https://phabricator.wikimedia.org/T205396) [08:53:25] (03CR) 10Elukey: [C: 03+1] "Fine for me, I'll update DNS accordingly afterwards :)" [puppet] - 10https://gerrit.wikimedia.org/r/489633 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [08:54:57] (03CR) 10Elukey: [C: 03+1] "Everything looks good, but I can +2 (without checking) only the turnilo/superset parts :)" [puppet] - 10https://gerrit.wikimedia.org/r/489625 (owner: 10Muehlenhoff) [09:08:11] (03CR) 10Marostegui: [C: 03+2] dbstore analytics: Move staging database [puppet] - 10https://gerrit.wikimedia.org/r/489633 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [09:11:17] (03PS1) 10Marostegui: wmnet: Move staging db from dbstore1003 to dbstore1005 [dns] - 10https://gerrit.wikimedia.org/r/489636 (https://phabricator.wikimedia.org/T210478) [09:11:56] !log Stop all mysql instances on dbstore1003 for reboot [09:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:00] (03CR) 10Gehel: [C: 04-1] "see comments inline" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/489628 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [09:22:19] (03PS1) 10Muehlenhoff: Install facter 2.4.6 on buster in early d-i stage [puppet] - 10https://gerrit.wikimedia.org/r/489638 (https://phabricator.wikimedia.org/T213546) [09:22:23] (03PS2) 10Mathew.onipe: add temporary lvs enabled option [cookbooks] - 10https://gerrit.wikimedia.org/r/489628 (https://phabricator.wikimedia.org/T207920) [09:23:51] (03CR) 10Mathew.onipe: add temporary lvs enabled option (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/489628 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [09:24:33] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10jcrespo) > Second, mariadb::packages_wmf and mariadb::packages should probably be merged i... [09:25:54] (03PS3) 10Mathew.onipe: add temporary lvs enabled option [cookbooks] - 10https://gerrit.wikimedia.org/r/489628 (https://phabricator.wikimedia.org/T207920) [09:37:22] (03PS1) 10Muehlenhoff: maps/osm: Stop supporting trusty [puppet] - 10https://gerrit.wikimedia.org/r/489641 [09:37:31] (03PS2) 10Giuseppe Lavagetto: tmpreaper: do not break systemd units using a PrivateTmp directory [puppet] - 10https://gerrit.wikimedia.org/r/489611 [09:38:50] !log Stop all mysql instances on dbstore1005 for reboot [09:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:38] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10Pginer-WMF) >>! In T212197#4941262, @dr0ptp4kt wrote: > I'll need to check with @Pginer-WMF about the mental model and nearer term (next si... [09:45:20] (03CR) 10Elukey: [C: 03+2] wmnet: Move staging db from dbstore1003 to dbstore1005 [dns] - 10https://gerrit.wikimedia.org/r/489636 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [09:54:32] (03PS1) 10Marostegui: mariadb: Change staging DB to 3350 [puppet] - 10https://gerrit.wikimedia.org/r/489644 (https://phabricator.wikimedia.org/T210478) [09:54:54] (03PS1) 10Marostegui: wmnet: Change staging db port [dns] - 10https://gerrit.wikimedia.org/r/489645 (https://phabricator.wikimedia.org/T210478) [09:55:19] elukey: I need to change the port again ^ [09:55:43] There is a conflict when sharing a host with x1, that we (I) didn't think about :) [09:55:53] ah snap [09:55:54] okok [09:55:57] +1 [09:56:01] take a look at the puppet change [09:56:06] as it touches one of your files too [09:56:23] ah I was about to change the superset thing! [09:56:25] awesome [09:56:26] (03CR) 10Marostegui: [C: 03+2] wmnet: Change staging db port [dns] - 10https://gerrit.wikimedia.org/r/489645 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [09:56:44] (03CR) 10Elukey: [C: 03+1] mariadb: Change staging DB to 3350 [puppet] - 10https://gerrit.wikimedia.org/r/489644 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [09:57:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Change staging DB to 3350 [puppet] - 10https://gerrit.wikimedia.org/r/489644 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [10:01:18] (03Abandoned) 10Mathew.onipe: icinga: remove check_elasticsearch_shard command [puppet] - 10https://gerrit.wikimedia.org/r/488511 (owner: 10Mathew.onipe) [10:08:05] (03CR) 10Fsero: [C: 04-1] Helm chart for eventgate-analytics deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [10:09:47] (03PS2) 10Jcrespo: mariadb: Enable notifications for db1118 [puppet] - 10https://gerrit.wikimedia.org/r/489280 (https://phabricator.wikimedia.org/T214720) [10:09:59] (03CR) 10DCausse: [C: 03+1] add temporary lvs enabled option [cookbooks] - 10https://gerrit.wikimedia.org/r/489628 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [10:11:09] (03CR) 10Jcrespo: [C: 03+2] mariadb: Enable notifications for db1118 [puppet] - 10https://gerrit.wikimedia.org/r/489280 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [10:12:04] (03PS3) 10Giuseppe Lavagetto: tmpreaper: do not break systemd units using a PrivateTmp directory [puppet] - 10https://gerrit.wikimedia.org/r/489611 [10:12:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] tmpreaper: do not break systemd units using a PrivateTmp directory [puppet] - 10https://gerrit.wikimedia.org/r/489611 (owner: 10Giuseppe Lavagetto) [10:12:30] (03CR) 10Mathew.onipe: [C: 03+1] maps/osm: Stop supporting trusty [puppet] - 10https://gerrit.wikimedia.org/r/489641 (owner: 10Muehlenhoff) [10:14:11] (03PS1) 10Jcrespo: install_server: Remove db1118 from the list of automatic reimage hosts [puppet] - 10https://gerrit.wikimedia.org/r/489647 (https://phabricator.wikimedia.org/T214720) [10:16:19] <_joe_> what's up with CI? [10:17:23] 10Operations, 10ops-eqiad: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10fgiunchedi) Thanks @Cmjohnson ! Please treat this as priority this week since we're running short on disk space on existing logstash eqiad hosts. [10:17:26] !log restart db1114 [10:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:34] !log mvolz@deploy1001 scap-helm zotero upgrade staging -f zotero-values-staging.yaml --version=0.0.1 stable/zotero [namespace: zotero, clusters: staging] [10:17:35] !log mvolz@deploy1001 scap-helm zotero cluster staging completed [10:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:35] !log mvolz@deploy1001 scap-helm zotero finished [10:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:42] _joe_: what do you see wrong? [10:18:11] is it slow or something else? [10:18:47] (03PS2) 10Jcrespo: install_server: Remove db1118 from the list of automatic reimage hosts [puppet] - 10https://gerrit.wikimedia.org/r/489647 (https://phabricator.wikimedia.org/T214720) [10:19:12] !log Add dbstore1005:3350 to tendril and zarcillo - T210478 [10:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:15] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [10:19:39] (03CR) 10Jcrespo: "Manuel: Please give it a careful look" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [10:22:17] (03PS8) 10Giuseppe Lavagetto: role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 [10:24:45] (03CR) 10Marostegui: [C: 04-1] "The most important...IPs are correct." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [10:24:51] !log mvolz@deploy1001 scap-helm zotero upgrade production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [10:24:52] !log mvolz@deploy1001 scap-helm zotero cluster eqiad completed [10:24:52] !log mvolz@deploy1001 scap-helm zotero finished [10:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:18] (03PS3) 10Elukey: role::analytics_test_cluster::coordinator: add basic camus support [puppet] - 10https://gerrit.wikimedia.org/r/489243 (https://phabricator.wikimedia.org/T212259) [10:27:47] !log mvolz@deploy1001 scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [10:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:48] !log mvolz@deploy1001 scap-helm zotero cluster codfw completed [10:27:48] !log mvolz@deploy1001 scap-helm zotero finished [10:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:11] (03CR) 10Jcrespo: mariadb: Introduce and pool db1118 with low weight (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [10:29:55] (03PS2) 10Jcrespo: mariadb: Introduce and pool db1118 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) [10:30:05] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190211T1030). [10:30:19] (03PS1) 10Muehlenhoff: Remove access for ISI researchers [puppet] - 10https://gerrit.wikimedia.org/r/489648 [10:30:21] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489649 (https://phabricator.wikimedia.org/T128546) [10:32:05] (03CR) 10Jcrespo: [C: 03+2] install_server: Remove db1118 from the list of automatic reimage hosts [puppet] - 10https://gerrit.wikimedia.org/r/489647 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [10:35:01] (03CR) 10Marostegui: [C: 03+1] "Minor typo on the commit message "remoce" feel free to ignore" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [10:36:25] (03PS3) 10Jcrespo: mariadb: Introduce and pool db1118 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) [10:39:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 (owner: 10Giuseppe Lavagetto) [10:39:55] (03PS9) 10Giuseppe Lavagetto: role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 [10:40:04] (03CR) 10Gehel: [C: 03+2] add temporary lvs enabled option [cookbooks] - 10https://gerrit.wikimedia.org/r/489628 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [10:40:13] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 (owner: 10Giuseppe Lavagetto) [10:40:20] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489649 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:41:05] !log upgrading mariadb client on cumin* hosts [10:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:34] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489649 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:42:16] (03CR) 10Gehel: [C: 03+2] elasticsearch_cluster: fix issues from test result [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [10:43:31] (03CR) 10jenkins-bot: elasticsearch_cluster: fix issues from test result [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [10:44:33] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489649 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:45:16] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10PHP 7.2 support, 10User-jijiki: mwscript dies on mwmaint with PHP=php7.2 due to php-redis missing - https://phabricator.wikimedia.org/T215376 (10Joe) 05Open→03Resolved [10:45:22] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe) [10:45:39] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/489641 (owner: 10Muehlenhoff) [10:46:09] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/489252 (owner: 10Muehlenhoff) [10:46:32] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:489649| Bumping portals to master (T128546)]] (duration: 00m 48s) [10:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:34] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:47:19] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:489649| Bumping portals to master (T128546)]] (duration: 00m 46s) [10:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:45] 10Operations, 10cloud-services-team, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Prometheus to 2.7 in deployment-prep and tools - https://phabricator.wikimedia.org/T215272 (10fgiunchedi) [10:50:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10Joe) I will assume you can successfully access and just resolve the ticket. Please reopen it if any issue happens. [10:51:48] (03CR) 10Jcrespo: [C: 03+1] Don't install pxz on buster [puppet] - 10https://gerrit.wikimedia.org/r/489252 (owner: 10Muehlenhoff) [10:54:00] (03CR) 10Jbond: [C: 03+2] "small nitpick otherwise looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489237 (owner: 10Muehlenhoff) [10:54:04] (03CR) 10Elukey: [C: 03+1] Remove access for ISI researchers [puppet] - 10https://gerrit.wikimedia.org/r/489648 (owner: 10Muehlenhoff) [10:54:20] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [10:54:23] 10Operations, 10cloud-services-team, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Prometheus to 2.7 in deployment-prep and tools - https://phabricator.wikimedia.org/T215272 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Tools and deployment-prep are running Prometheus 2.7.1 rebuilt... [10:54:45] (03PS1) 10Jcrespo: mariadb-package: Upgrade to 10.1.38, add mariabackup to path [software] - 10https://gerrit.wikimedia.org/r/489657 (https://phabricator.wikimedia.org/T210292) [10:55:29] (03CR) 10jerkins-bot: [V: 04-1] mariadb-package: Upgrade to 10.1.38, add mariabackup to path [software] - 10https://gerrit.wikimedia.org/r/489657 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [10:56:36] (03CR) 10Jcrespo: [V: 03+1] mariadb-package: Upgrade to 10.1.38, add mariabackup to path [software] - 10https://gerrit.wikimedia.org/r/489657 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [10:57:06] (03CR) 10Jbond: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/489635 (https://phabricator.wikimedia.org/T205396) (owner: 10Muehlenhoff) [11:00:46] (03CR) 10Muehlenhoff: Only enable backports up to stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489237 (owner: 10Muehlenhoff) [11:04:17] (03CR) 10Jbond: [C: 03+1] "lgtm, could also add the following but its probably overkill" [puppet] - 10https://gerrit.wikimedia.org/r/489638 (https://phabricator.wikimedia.org/T213546) (owner: 10Muehlenhoff) [11:09:29] (03CR) 10Arturo Borrero Gonzalez: "I would prefer if you don't include the d/changelog update in this same commit. That makes easier to work with this patch itself, like che" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/489409 (https://phabricator.wikimedia.org/T178601) (owner: 10BryanDavis) [11:11:02] (03PS5) 10Mathew.onipe: admin: create new system groups for cloudelastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) [11:15:15] (03CR) 10Filippo Giunchedi: "LGTM, modulo John's comments" [puppet] - 10https://gerrit.wikimedia.org/r/489237 (owner: 10Muehlenhoff) [11:15:41] (03PS6) 10Gehel: admin: create new system groups for cloudelastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [11:16:34] (03CR) 10Filippo Giunchedi: [C: 03+1] Drop requires_os checks for trusty [puppet] - 10https://gerrit.wikimedia.org/r/489625 (owner: 10Muehlenhoff) [11:19:15] (03PS3) 10Mathew.onipe: icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/489154 (https://phabricator.wikimedia.org/T212850) [11:31:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] Helm chart for eventgate-analytics deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [11:31:55] (03PS2) 10KartikMistry: WIP: Add ExternalGuidance extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489627 (https://phabricator.wikimedia.org/T213076) [11:33:42] (03PS1) 10Elukey: Add common statistics repositories to stat/notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/489660 (https://phabricator.wikimedia.org/T212386) [11:34:07] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I would create a separate profile with this config. Something explicit, like: profile::wmcs::root_tty and then include it in whatever role" [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [11:34:46] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, LGTM overall" (033 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [11:40:20] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14592/" [puppet] - 10https://gerrit.wikimedia.org/r/489660 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [11:50:38] (03PS2) 10Muehlenhoff: Only enable backports up to stretch [puppet] - 10https://gerrit.wikimedia.org/r/489237 [11:50:40] (03PS2) 10Muehlenhoff: Remove access for ISI researchers [puppet] - 10https://gerrit.wikimedia.org/r/489648 [11:56:13] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for ISI researchers [puppet] - 10https://gerrit.wikimedia.org/r/489648 (owner: 10Muehlenhoff) [11:57:47] (03CR) 10Joal: [C: 03+1] "Thanks elukey :)" [puppet] - 10https://gerrit.wikimedia.org/r/489660 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190211T1200). [12:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:03:15] Urbanecm: around for swat? [12:07:14] zeljkof, sorry, I'm late [12:08:41] Urbanecm: no problemo, but well, there's nothing for you to do anyway, right? [12:08:48] I'll just deploy the patches and let you knwo [12:08:50] know [12:08:53] ok, good [12:08:59] (03CR) 10Fsero: Helm chart for eventgate-analytics deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [12:09:06] * Urbanecm always keep forgotting what he scheduled for SWAT :D [12:09:10] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) Thanks @Arrbee. Thanks @Pginer-WMF. @santhosh and @Gilles the footer list containing the "Desktop" link and other list items pl... [12:09:47] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489398 (owner: 10Urbanecm) [12:09:51] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[enforce-users-groups-cleanup] [12:11:03] (03Merged) 10jenkins-bot: Clean expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489398 (owner: 10Urbanecm) [12:13:50] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) Heads up @chelsyx: for simplewiki access via the Google Translate proxy the traffic pattern is now mobile web based even for des... [12:14:18] (03CR) 10Jcrespo: [C: 03+2] mariadb: Introduce and pool db1118 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [12:14:55] (03PS4) 10Jcrespo: mariadb: Introduce and pool db1118 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) [12:14:57] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:489398|Clean expired throttle rules]] (duration: 00m 48s) [12:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:04] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489399 (https://phabricator.wikimedia.org/T215618) (owner: 10Urbanecm) [12:15:16] (03CR) 10jerkins-bot: [V: 04-1] New throttle rule for Senior Citizens Write Wikipedia course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489399 (https://phabricator.wikimedia.org/T215618) (owner: 10Urbanecm) [12:15:19] Urbanecm: the first patch deployed [12:15:22] thanks [12:15:26] (03CR) 10Zfilipin: New throttle rule for Senior Citizens Write Wikipedia course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489399 (https://phabricator.wikimedia.org/T215618) (owner: 10Urbanecm) [12:15:28] (03CR) 10Jcrespo: [C: 04-1] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [12:15:40] (03PS3) 10Zfilipin: New throttle rule for Senior Citizens Write Wikipedia course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489399 (https://phabricator.wikimedia.org/T215618) (owner: 10Urbanecm) [12:15:47] (03CR) 10jenkins-bot: Clean expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489398 (owner: 10Urbanecm) [12:15:49] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:15:56] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489399 (https://phabricator.wikimedia.org/T215618) (owner: 10Urbanecm) [12:17:02] (03Merged) 10jenkins-bot: New throttle rule for Senior Citizens Write Wikipedia course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489399 (https://phabricator.wikimedia.org/T215618) (owner: 10Urbanecm) [12:18:22] 10Operations, 10Release-Engineering-Team, 10Category, 10Core Platform Team Backlog (Watching / External), and 2 others: FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453 (10akosiaris) [12:18:24] 10Operations, 10Goal, 10Kubernetes: Operations Q1 goal: Streamlined Service Delivery - https://phabricator.wikimedia.org/T170108 (10akosiaris) 05Open→03Resolved a:03akosiaris Seems like we forgot to close this one [12:18:27] 10Operations, 10Goal, 10Kubernetes: Operations Q1 goal: Streamlined Service Delivery - https://phabricator.wikimedia.org/T170108 (10akosiaris) [12:18:35] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:489399|New throttle rule for Senior Citizens Write Wikipedia course (T215618)]] (duration: 00m 48s) [12:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:39] T215618: Throttle rule for 2019-02-13 - Senior Citizens Write Wikipedia course - https://phabricator.wikimedia.org/T215618 [12:19:04] 10Operations, 10Certcentral, 10Traffic: certcentral fails to renew certificates - https://phabricator.wikimedia.org/T215783 (10Vgutierrez) [12:19:07] Urbanecm: the second patch deployed, thanks for deploying with #releng ;) [12:19:10] yw [12:19:14] !log EU SWAT finished [12:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:23] 10Operations, 10Certcentral, 10Traffic: certcentral fails to renew certificates - https://phabricator.wikimedia.org/T215783 (10Vgutierrez) p:05Triage→03High [12:20:10] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:20:10] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [12:20:46] that's me ^ [12:20:54] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) [12:21:03] !log bounce rsyslogd on lithium / wezen, syslog tls listener stuck [12:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:32] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 922 days) [12:24:16] (03CR) 10Filippo Giunchedi: [C: 03+1] Only enable backports up to stretch [puppet] - 10https://gerrit.wikimedia.org/r/489237 (owner: 10Muehlenhoff) [12:24:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/489237 (owner: 10Muehlenhoff) [12:25:32] (03PS1) 10Vgutierrez: certcentral: Fix validation_dns_servers key name [puppet] - 10https://gerrit.wikimedia.org/r/489667 (https://phabricator.wikimedia.org/T215783) [12:27:11] (03CR) 10jenkins-bot: New throttle rule for Senior Citizens Write Wikipedia course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489399 (https://phabricator.wikimedia.org/T215618) (owner: 10Urbanecm) [12:28:18] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy and shows the expected change in the certcentral config file: https://puppet-compiler.wmflabs.org/compiler1002/14594/" [puppet] - 10https://gerrit.wikimedia.org/r/489667 (https://phabricator.wikimedia.org/T215783) (owner: 10Vgutierrez) [12:34:31] (03PS1) 10Jcrespo: mariadb: Depool db1106 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489669 [12:42:28] (03CR) 10MarcoAurelio: Create 'extendedconfirmed' user group for viwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) (owner: 10Tulsi Bhagat) [12:43:21] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486221 (https://phabricator.wikimedia.org/T214553) (owner: 10Ammarpad) [12:43:42] (03PS3) 10Muehlenhoff: Only enable backports up to stretch [puppet] - 10https://gerrit.wikimedia.org/r/489237 [12:46:02] (03CR) 10Muehlenhoff: [C: 03+2] Only enable backports up to stretch [puppet] - 10https://gerrit.wikimedia.org/r/489237 (owner: 10Muehlenhoff) [12:46:10] 10Operations, 10Certcentral, 10Traffic, 10Patch-For-Review: certcentral fails to renew certificates - https://phabricator.wikimedia.org/T215783 (10Vgutierrez) 05Open→03Resolved As soon as https://gerrit.wikimedia.org/r/489164 had been merged, the certificates has been renewed as expected: ` Feb 11 12:... [12:55:55] 10Operations, 10monitoring: Expose linux kernel firewall and connections statistics - https://phabricator.wikimedia.org/T215277 (10jbond) 05Open→03Stalled [12:58:01] 10Operations, 10Advanced Mobile Contributions, 10Traffic, 10User-Joe: AMC – Opt-in for logged out users - https://phabricator.wikimedia.org/T215624 (10phuedx) @Joe: Yes. That's correct. Do note that this is the current behaviour for the beta mode of the mobile site (visit https://m.mediawiki.org/wiki/Speci... [13:06:36] (03PS2) 10Tulsi Bhagat: Create 'extendedconfirmed' user group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) [13:07:21] (03CR) 10jerkins-bot: [V: 04-1] Create 'extendedconfirmed' user group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) (owner: 10Tulsi Bhagat) [13:11:23] (03PS3) 10Tulsi Bhagat: Create 'extendedconfirmed' user group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) [13:14:29] (03CR) 10Tulsi Bhagat: "@MarcoAurelio: Thank you so much! ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) (owner: 10Tulsi Bhagat) [13:33:31] (03PS2) 10Muehlenhoff: Don't install pxz on buster [puppet] - 10https://gerrit.wikimedia.org/r/489252 [13:34:41] (03CR) 10Muehlenhoff: [C: 03+2] Don't install pxz on buster [puppet] - 10https://gerrit.wikimedia.org/r/489252 (owner: 10Muehlenhoff) [13:37:36] 10Operations, 10Cloud-VPS, 10Toolforge, 10Traffic, 10Patch-For-Review: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10akosiaris) >>! In T213475#4941129, @Cyberpower678 wrote: > Question, when will th... [13:51:09] (03PS1) 10Jbond: Offboard user Edward Galvez (chedasaurus) [puppet] - 10https://gerrit.wikimedia.org/r/489682 (https://phabricator.wikimedia.org/T215792) [13:51:11] (03PS1) 10GTirloni: wiki replicas: depool labsdb1010 for changes [puppet] - 10https://gerrit.wikimedia.org/r/489683 (https://phabricator.wikimedia.org/T212308) [13:51:36] (03PS2) 10Jbond: Offboard user Edward Galvez (chedasaurus) [puppet] - 10https://gerrit.wikimedia.org/r/489682 (https://phabricator.wikimedia.org/T215792) [13:52:53] (03CR) 10jerkins-bot: [V: 04-1] Offboard user Edward Galvez (chedasaurus) [puppet] - 10https://gerrit.wikimedia.org/r/489682 (https://phabricator.wikimedia.org/T215792) (owner: 10Jbond) [13:54:05] (03CR) 10Muehlenhoff: Offboard user Edward Galvez (chedasaurus) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489682 (https://phabricator.wikimedia.org/T215792) (owner: 10Jbond) [13:54:21] (03PS2) 10Muehlenhoff: Only install mcelog on jessie and stretch [puppet] - 10https://gerrit.wikimedia.org/r/489635 (https://phabricator.wikimedia.org/T205396) [13:59:39] (03PS3) 10Jbond: Offboard user Edward Galvez (chedasaurus) [puppet] - 10https://gerrit.wikimedia.org/r/489682 (https://phabricator.wikimedia.org/T215792) [14:00:26] (03CR) 10jerkins-bot: [V: 04-1] Offboard user Edward Galvez (chedasaurus) [puppet] - 10https://gerrit.wikimedia.org/r/489682 (https://phabricator.wikimedia.org/T215792) (owner: 10Jbond) [14:01:26] (03Abandoned) 10Jbond: Offboard user Edward Galvez (chedasaurus) [puppet] - 10https://gerrit.wikimedia.org/r/489682 (https://phabricator.wikimedia.org/T215792) (owner: 10Jbond) [14:01:28] (03CR) 10Jbond: Offboard user Edward Galvez (chedasaurus) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489682 (https://phabricator.wikimedia.org/T215792) (owner: 10Jbond) [14:06:06] (03CR) 10Marostegui: [C: 03+1] wiki replicas: depool labsdb1010 for changes [puppet] - 10https://gerrit.wikimedia.org/r/489683 (https://phabricator.wikimedia.org/T212308) (owner: 10GTirloni) [14:06:09] (03PS1) 10Jbond: Offboard user Edward Galvez (chedasaurus) [puppet] - 10https://gerrit.wikimedia.org/r/489687 (https://phabricator.wikimedia.org/T215792) [14:06:45] (03CR) 10jerkins-bot: [V: 04-1] Offboard user Edward Galvez (chedasaurus) [puppet] - 10https://gerrit.wikimedia.org/r/489687 (https://phabricator.wikimedia.org/T215792) (owner: 10Jbond) [14:07:18] (03PS2) 10GTirloni: wiki replicas: depool labsdb1010 for changes [puppet] - 10https://gerrit.wikimedia.org/r/489683 (https://phabricator.wikimedia.org/T212308) [14:07:35] (03PS2) 10Jbond: Offboard user Edward Galvez (chedasaurus) [puppet] - 10https://gerrit.wikimedia.org/r/489687 (https://phabricator.wikimedia.org/T215792) [14:08:01] (03CR) 10GTirloni: [C: 03+2] wiki replicas: depool labsdb1010 for changes [puppet] - 10https://gerrit.wikimedia.org/r/489683 (https://phabricator.wikimedia.org/T212308) (owner: 10GTirloni) [14:08:26] !log Deploy schema change on db1116:3318 - T210713 [14:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:29] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [14:09:08] (03PS3) 10Jbond: Offboard user Edward Galvez (chedasaurus) [puppet] - 10https://gerrit.wikimedia.org/r/489687 (https://phabricator.wikimedia.org/T215792) [14:09:58] !log Reload haproxy on dbproxy1010 to depool labsdb1010 - T212308 [14:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:01] T212308: Rerun maintain-views for all tables to drop valid_tag and tag_summary tables - https://phabricator.wikimedia.org/T212308 [14:10:45] (03CR) 10Jbond: [C: 03+2] Offboard user Edward Galvez (chedasaurus) [puppet] - 10https://gerrit.wikimedia.org/r/489687 (https://phabricator.wikimedia.org/T215792) (owner: 10Jbond) [14:16:49] !log depool and take a snapshot of prometheus data for all instances on prometheus2003 - T187987 [14:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:52] T187987: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 [14:21:09] !log Remove staging from dbstore1003 - T210478 [14:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:11] T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 [14:21:48] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10MoritzMuehlenhoff) We currently pin prometheus-node-exporter to 0.17.0+ds-2 on the selected hosts and for buster, but yesterday 0.17.0+ds-3 migrated... [14:23:32] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:23:59] (03CR) 10Andrew Bogott: "> profile::wmcs::root_tty" [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [14:25:40] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:26:29] (03CR) 10Muehlenhoff: "Ack, we don't need the pin in d-i, once puppet runs the first time the correct pinning is recreated from puppet anyway." [puppet] - 10https://gerrit.wikimedia.org/r/489638 (https://phabricator.wikimedia.org/T213546) (owner: 10Muehlenhoff) [14:26:49] (03PS2) 10Muehlenhoff: Install facter 2.4.6 on buster in early d-i stage [puppet] - 10https://gerrit.wikimedia.org/r/489638 (https://phabricator.wikimedia.org/T213546) [14:28:47] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [14:30:18] (03PS6) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [14:30:40] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:31:06] (03CR) 10jerkins-bot: [V: 04-1] Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [14:31:28] (03CR) 10Muehlenhoff: [C: 03+2] Install facter 2.4.6 on buster in early d-i stage [puppet] - 10https://gerrit.wikimedia.org/r/489638 (https://phabricator.wikimedia.org/T213546) (owner: 10Muehlenhoff) [14:31:55] (03CR) 10GTirloni: "First attempt(s) at converting admin cron jobs to systemd timers. I would like your input before I go on changing more jobs." [puppet] - 10https://gerrit.wikimedia.org/r/489394 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [14:32:16] (03CR) 10GTirloni: "First attempt(s) at converting admin cron jobs to systemd timers. I would like your input before I go on changing more jobs." [puppet] - 10https://gerrit.wikimedia.org/r/489393 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [14:32:28] jouncebot: now [14:32:28] No deployments scheduled for the next 3 hour(s) and 27 minute(s) [14:32:38] Coolio!, I'm going to merge some beta config changes [14:32:43] (03PS7) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [14:32:45] (once i prepare them) [14:33:02] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:34:22] (03PS8) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [14:43:16] (03CR) 10Addshore: "This is already essentially deployed? :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480550 (https://phabricator.wikimedia.org/T201838) (owner: 10WMDE-leszek) [14:43:18] (03CR) 10Addshore: "This is already essentially deployed? :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480551 (https://phabricator.wikimedia.org/T201838) (owner: 10WMDE-leszek) [14:45:21] (03CR) 10Ottomata: "you got it! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria) [14:47:26] !log installing curl security updates on trusty [14:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:36] (03CR) 10Ottomata: "nice" [puppet] - 10https://gerrit.wikimedia.org/r/489243 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [14:47:39] (03CR) 10Ottomata: [C: 03+1] role::analytics_test_cluster::coordinator: add basic camus support [puppet] - 10https://gerrit.wikimedia.org/r/489243 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [14:48:06] (03CR) 10Jcrespo: [C: 03+2] mariadb: Introduce and pool db1118 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [14:48:13] (03PS1) 10Addshore: Wikibase.php, add conditional setting of useEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489698 (https://phabricator.wikimedia.org/T214557) [14:48:25] !log mbsantos@deploy1001 Started deploy [kartotherian/deploy@173adbe] (stretch): Updating maps2004 kartotherian for the stretch migration work [14:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:43] PROBLEM - High CPU load on API appserver on mw1348 is CRITICAL: CRITICAL - load average: 86.53, 36.12, 22.31 [14:48:46] !log mbsantos@deploy1001 Finished deploy [kartotherian/deploy@173adbe] (stretch): Updating maps2004 kartotherian for the stretch migration work (duration: 00m 21s) [14:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:12] (03PS1) 10Addshore: BETA: wmgUseEntitySourceBasedFederation true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489701 (https://phabricator.wikimedia.org/T214557) [14:49:17] (03Merged) 10jenkins-bot: mariadb: Introduce and pool db1118 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [14:49:42] (03CR) 10jenkins-bot: mariadb: Introduce and pool db1118 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489281 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [14:49:43] RECOVERY - High CPU load on API appserver on mw1348 is OK: OK - load average: 34.40, 30.30, 21.17 [14:50:46] !log mbsantos@deploy1001 Started deploy [tilerator/deploy@d546183] (stretch): Updating maps2004 tilerator for the stretch migration work [14:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:25] !log mbsantos@deploy1001 Finished deploy [tilerator/deploy@d546183] (stretch): Updating maps2004 tilerator for the stretch migration work (duration: 00m 39s) [14:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:35] (03CR) 10Addshore: [C: 03+2] Wikibase.php, add conditional setting of useEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489698 (https://phabricator.wikimedia.org/T214557) (owner: 10Addshore) [14:51:43] (03PS2) 10Addshore: Wikibase.php, add conditional setting of useEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489698 (https://phabricator.wikimedia.org/T214557) [14:51:46] (03CR) 10Addshore: [C: 03+2] Wikibase.php, add conditional setting of useEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489698 (https://phabricator.wikimedia.org/T214557) (owner: 10Addshore) [14:52:53] (03Merged) 10jenkins-bot: Wikibase.php, add conditional setting of useEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489698 (https://phabricator.wikimedia.org/T214557) (owner: 10Addshore) [14:53:06] (03PS1) 10Marostegui: mariadb: Disable local_infile [puppet] - 10https://gerrit.wikimedia.org/r/489703 [14:53:40] jynus: just noticed that the patch re db1118 just before mine I'l let you continue and do mine after [14:53:53] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool db1118 for the first time (duration: 00m 47s) [14:53:54] (as i dont see the sync above) [14:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:58] aaah, there it is! :) [14:54:19] (03PS2) 10Marostegui: mariadb: Disable local_infile on some roles [puppet] - 10https://gerrit.wikimedia.org/r/489703 [14:55:24] I am reverting [14:55:32] ack [14:55:51] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Revert (duration: 00m 45s) [14:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:58] (03PS1) 10Jcrespo: Revert "mariadb: Introduce and pool db1118 with low weight" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489704 [14:56:57] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "mariadb: Introduce and pool db1118 with low weight" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489704 (owner: 10Jcrespo) [14:57:06] grants issue? [14:57:18] probably [14:57:29] addshore: do you want me to rebase your change? [14:57:34] Yeah, no wikiadmin there from what I can see [14:57:39] wikiuser [14:57:56] jynus: whatever works for you, it will be a noop in prod (just a beta change) [14:58:02] i can sync it once you are done [14:58:30] I have rebased your change, I wanted to do it to leave staging and deploy code in sync [14:58:39] yup, it looks good to me now :) [14:58:39] as I deployed before reverting [14:59:38] jynus: am I okay to go ahead and sync it from your side? [15:00:00] (03CR) 10jenkins-bot: Wikibase.php, add conditional setting of useEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489698 (https://phabricator.wikimedia.org/T214557) (owner: 10Addshore) [15:00:02] (03CR) 10jenkins-bot: Revert "mariadb: Introduce and pool db1118 with low weight" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489704 (owner: 10Jcrespo) [15:00:23] addshore: not touching deployment eqiad for a while now [15:00:30] ack! [15:00:30] one sec to confirm errors stopped [15:00:33] !log addshore@deploy1001 sync-file aborted: Wikibase.php, add conditional setting of useEntitySourceBasedFederation (duration: 00m 01s) [15:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:39] okay :P [15:01:27] the problem is the jobqueue has a lot of tail [15:02:36] (03Abandoned) 10WMDE-leszek: Beta: use the new link formatter to format P1 on wikidata beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480551 (https://phabricator.wikimedia.org/T201838) (owner: 10WMDE-leszek) [15:02:45] (03Abandoned) 10WMDE-leszek: Added setting to adjust the range of PropertyIDs using new link formatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480550 (https://phabricator.wikimedia.org/T201838) (owner: 10WMDE-leszek) [15:03:01] it has a few seconds of a bad config and it keeps headbutting on it for a long time [15:03:48] (03CR) 10WMDE-leszek: "yay!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489698 (https://phabricator.wikimedia.org/T214557) (owner: 10Addshore) [15:03:54] addshore: it also creates lag on logstash, so you may want to wait anyway [15:04:02] will do! :) [15:04:09] 5 minutes of lag right now [15:07:33] for some reason, it wasn't deployed properly [15:07:38] I am trying again [15:07:58] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Revert, second try (duration: 00m 47s) [15:07:58] jynus: there is no wikiuser on db1118 [15:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:07] ah right you are still reverting :) [15:08:09] marostegui: no, no the actual error [15:08:15] but the deploy [15:08:18] :) [15:09:15] (03PS1) 10Jcrespo: Revert "Revert "mariadb: Introduce and pool db1118 with low weight"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489711 [15:11:01] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [15:12:12] (03CR) 10Muehlenhoff: "Looks nice from a first glance, some comments inline. I'll test this in a labs instance more completely tomorrow." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [15:16:19] wow, logstash is still super delayed [15:16:36] yeah :-) [15:17:32] addshore: https://phabricator.wikimedia.org/T215611 [15:17:49] vote on ^that to give more visibility [15:18:20] ottomata: hiiiii - when you are caffeinated lemme know if https://gerrit.wikimedia.org/r/#/c/489660/ is good or no-bueno [15:18:33] (03PS1) 10GTirloni: Revert "wiki replicas: depool labsdb1010 for changes" [puppet] - 10https://gerrit.wikimedia.org/r/489713 (https://phabricator.wikimedia.org/T212308) [15:18:40] this should have been for #analytics --^ [15:18:46] :) [15:19:14] !log add missing grants to db1118 [15:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:39] (03CR) 10GTirloni: [C: 03+2] Revert "wiki replicas: depool labsdb1010 for changes" [puppet] - 10https://gerrit.wikimedia.org/r/489713 (https://phabricator.wikimedia.org/T212308) (owner: 10GTirloni) [15:20:09] 13 mins of lag [15:20:26] well, I'm going to continue, watching on mwlog, as my 2 patches are beta only :) [15:20:40] elukey: interesting.... [15:20:58] !log Repool labsdb1010 - T212308 [15:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:02] T212308: Rerun maintain-views for all tables to drop valid_tag and tag_summary tables - https://phabricator.wikimedia.org/T212308 [15:21:04] should that just go in a stat profile? [15:21:22] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: Wikibase.php, add conditional setting of useEntitySourceBasedFederation (duration: 00m 47s) [15:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:37] (03PS2) 10Addshore: BETA: wmgUseEntitySourceBasedFederation true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489701 (https://phabricator.wikimedia.org/T214557) [15:21:40] (03CR) 10Addshore: [C: 03+2] BETA: wmgUseEntitySourceBasedFederation true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489701 (https://phabricator.wikimedia.org/T214557) (owner: 10Addshore) [15:21:43] i see its kinda like the packages one... [15:21:43] hm [15:22:05] (03PS1) 10GTirloni: wiki replicas: depool labsdb1011 for changes [puppet] - 10https://gerrit.wikimedia.org/r/489714 (https://phabricator.wikimedia.org/T212308) [15:22:37] ottomata: yep exactly, the idea is to have it everywhere for people to use it.. we can have a single profile for packages/repositories/etc.. [15:22:38] (03Merged) 10jenkins-bot: BETA: wmgUseEntitySourceBasedFederation true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489701 (https://phabricator.wikimedia.org/T214557) (owner: 10Addshore) [15:22:40] (03CR) 10Marostegui: [C: 03+1] wiki replicas: depool labsdb1011 for changes [puppet] - 10https://gerrit.wikimedia.org/r/489714 (https://phabricator.wikimedia.org/T212308) (owner: 10GTirloni) [15:22:51] (03CR) 10GTirloni: [C: 03+2] wiki replicas: depool labsdb1011 for changes [puppet] - 10https://gerrit.wikimedia.org/r/489714 (https://phabricator.wikimedia.org/T212308) (owner: 10GTirloni) [15:23:05] (03CR) 10jenkins-bot: BETA: wmgUseEntitySourceBasedFederation true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489701 (https://phabricator.wikimedia.org/T214557) (owner: 10Addshore) [15:23:07] elukey: yeah maybe that'd be better...just call it... ::dependencies? [15:23:08] dunno [15:23:23] i don't think i care too much here :) I'll +1 and you can decide [15:23:28] (03CR) 10Ottomata: [C: 03+1] Add common statistics repositories to stat/notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/489660 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [15:23:44] !log Relohad haproxy on dbproxy1010 to depool labsdb1011 - https://phabricator.wikimedia.org/T212308 [15:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:06] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: BETA ONLY (duration: 00m 47s) [15:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:37] (03CR) 10Ottomata: Helm chart for eventgate-analytics deployment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [15:27:15] (03PS2) 10Elukey: Add common statistics repositories to stat/notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/489660 (https://phabricator.wikimedia.org/T212386) [15:27:30] ottomata: going to merge for the moment, but I have a note here to refactor :) [15:27:43] k! [15:27:45] thanks! [15:27:47] no worried, this might be fine [15:28:03] (03CR) 10Ottomata: Helm chart for eventgate-analytics deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [15:28:14] (03CR) 10Elukey: [C: 03+2] Add common statistics repositories to stat/notebook hosts [puppet] - 10https://gerrit.wikimedia.org/r/489660 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [15:30:32] (03PS1) 10Vgutierrez: secret: Add authdns-acmechief dummy keyholder SSH keys [labs/private] - 10https://gerrit.wikimedia.org/r/489715 [15:33:40] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10fgiunchedi) a:05fgiunchedi→03Papaul Host is in service at full weight, assigning to @Papaul for return of previous hardware [15:34:25] (03PS1) 10Vgutierrez: secret: add dummy LE ACMEv2 private keys for acmechief[12]001 [labs/private] - 10https://gerrit.wikimedia.org/r/489717 [15:36:42] (03PS2) 10Vgutierrez: secret: Add authdns-acmechief dummy keyholder SSH keys [labs/private] - 10https://gerrit.wikimedia.org/r/489715 (https://phabricator.wikimedia.org/T207389) [15:36:52] (03PS2) 10Vgutierrez: secret: add dummy LE ACMEv2 private keys for acmechief[12]001 [labs/private] - 10https://gerrit.wikimedia.org/r/489717 (https://phabricator.wikimedia.org/T207389) [15:37:09] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] secret: Add authdns-acmechief dummy keyholder SSH keys [labs/private] - 10https://gerrit.wikimedia.org/r/489715 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [15:37:19] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] secret: add dummy LE ACMEv2 private keys for acmechief[12]001 [labs/private] - 10https://gerrit.wikimedia.org/r/489717 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [15:37:30] (03PS3) 10Vgutierrez: secret: add dummy LE ACMEv2 private keys for acmechief[12]001 [labs/private] - 10https://gerrit.wikimedia.org/r/489717 (https://phabricator.wikimedia.org/T207389) [15:40:24] (03CR) 10Jcrespo: [C: 03+2] Revert "Revert "mariadb: Introduce and pool db1118 with low weight"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489711 (owner: 10Jcrespo) [15:41:28] (03Merged) 10jenkins-bot: Revert "Revert "mariadb: Introduce and pool db1118 with low weight"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489711 (owner: 10Jcrespo) [15:44:23] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki-config] [15:45:41] (03PS9) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [15:45:48] (03CR) 10jenkins-bot: Revert "Revert "mariadb: Introduce and pool db1118 with low weight"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489711 (owner: 10Jcrespo) [15:47:00] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki-config] [15:48:00] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/489154 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [15:48:07] (03PS4) 10Gehel: icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/489154 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [15:49:27] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1118 (duration: 00m 48s) [15:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:33] (03CR) 10Gehel: [C: 03+2] icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/489154 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [15:49:36] (03PS1) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [15:49:38] (03PS1) 10Vgutierrez: site: Add acmechief[12]001 as acme-chief servers [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) [15:49:39] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-fgiunchedi: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) [15:49:48] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) 05Open→03Resolved Previous hardware has been already returned since last Thursday. (See comment on Feb7) We can resolve this task. [15:50:10] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809 (10fsero) p:05Triage→03Normal [15:50:34] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [15:50:37] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) > How big is the dataset and how fast is it going to grow? In the hundreds of megabytes I believe. @half... [15:50:43] I see no errors this time [15:50:45] (03PS30) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [15:51:06] 10Operations, 10monitoring, 10Patch-For-Review: EDAC events not being reported by node-exporter? - https://phabricator.wikimedia.org/T214529 (10fgiunchedi) Thanks for the deep investigation, truly fascinating! WRT what @bblack was saying that the host was displaying errors on its LCD, I'm wondering if alerti... [15:51:18] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment/mediawiki-config] [15:51:32] checking --^ [15:53:04] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Package envoy 1.9.0 for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 (10fsero) p:05Triage→03Normal [15:56:21] (03CR) 10Vgutierrez: [C: 03+2] install_server: Add DHCP entries for acmechief[12]001 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/489164 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [15:56:32] (03PS2) 10Vgutierrez: install_server: Add DHCP entries for acmechief[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/489164 (https://phabricator.wikimedia.org/T207389) [15:57:05] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10EBernhardson) >>! In T213976#4943916, @Ottomata wrote: >> How big is the dataset and how fast is it going to grow?... [15:57:37] (03PS1) 10Elukey: profile::analytics::cluster::repositories::statistics: fix mw directory [puppet] - 10https://gerrit.wikimedia.org/r/489723 (https://phabricator.wikimedia.org/T212386) [15:58:53] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::repositories::statistics: fix mw directory [puppet] - 10https://gerrit.wikimedia.org/r/489723 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [15:59:00] (03PS2) 10Elukey: profile::analytics::cluster::repositories::statistics: fix mw directory [puppet] - 10https://gerrit.wikimedia.org/r/489723 (https://phabricator.wikimedia.org/T212386) [15:59:02] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::analytics::cluster::repositories::statistics: fix mw directory [puppet] - 10https://gerrit.wikimedia.org/r/489723 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [15:59:14] 10Operations, 10cloud-services-team (Kanban): reprepro: automate incoming processing - https://phabricator.wikimedia.org/T215812 (10aborrero) [16:02:06] (03CR) 10Jforrester: "(Yay.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489701 (https://phabricator.wikimedia.org/T214557) (owner: 10Addshore) [16:03:42] (03PS1) 10Elukey: profile::analytics::cluster::repo::statistics: fix directory (again) [puppet] - 10https://gerrit.wikimedia.org/r/489724 (https://phabricator.wikimedia.org/T212386) [16:04:43] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:05:29] 10Operations, 10monitoring, 10Goal, 10Patch-For-Review: Upgrade production prometheus-node-exporter to >= 0.16 - https://phabricator.wikimedia.org/T213708 (10fgiunchedi) Uploading `-3` internally and changing puppet to install that version sounds good to me! [16:07:21] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) @Cmjohnson I can take care of the installations once you've done the RAID and added DNS and pxeboot entries with the MACs :-) [16:07:35] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::repo::statistics: fix directory (again) [puppet] - 10https://gerrit.wikimedia.org/r/489724 (https://phabricator.wikimedia.org/T212386) (owner: 10Elukey) [16:07:44] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10EBernhardson) Longer term search will potentially want to generate some significantly larger datasets to ship to pr... [16:07:55] (03CR) 10Alexandros Kosiaris: "Reading https://phabricator.wikimedia.org/T209011 I gather that maybe it would be better to add the public IP space and not 172.16.0.0/12?" [puppet] - 10https://gerrit.wikimedia.org/r/488516 (https://phabricator.wikimedia.org/T213475) (owner: 10Alexandros Kosiaris) [16:09:51] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:11:48] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb-package: Upgrade to 10.1.38, add mariabackup to path [software] - 10https://gerrit.wikimedia.org/r/489657 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [16:12:18] (03CR) 10Bstorm: [C: 03+1] "It's very annoying that the puppet compiler cannot find the dummy sk for profile::grafana, so we can't use that to check for typos, it see" [puppet] - 10https://gerrit.wikimedia.org/r/489394 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [16:12:37] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:13:08] (03CR) 10Jdlrobson: [C: 03+1] Remove main page special casing from lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489420 (https://phabricator.wikimedia.org/T215709) (owner: 10Zoranzoki21) [16:13:32] (03CR) 10Jdlrobson: [C: 03+1] "This can be swatted in one of the available swat windows" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489420 (https://phabricator.wikimedia.org/T215709) (owner: 10Zoranzoki21) [16:15:15] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [16:17:03] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:17:41] (03PS31) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [16:18:14] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [16:18:38] (03PS10) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [16:19:34] (03PS2) 10Jcrespo: mariadb: Depool db1106 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489669 [16:20:25] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1106 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489669 (owner: 10Jcrespo) [16:21:22] 10Operations, 10Wikimedia-Mailing-lists, 10User-jijiki: Please create docker-sig@ mailing list - https://phabricator.wikimedia.org/T215563 (10jijiki) a:03jijiki [16:21:31] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1106 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489669 (owner: 10Jcrespo) [16:22:38] (03Merged) 10jenkins-bot: mariadb: Depool db1106 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489669 (owner: 10Jcrespo) [16:22:55] 10Operations, 10cloud-services-team (Kanban): reprepro: automate incoming processing - https://phabricator.wikimedia.org/T215812 (10jijiki) p:05Triage→03Normal [16:23:32] (03CR) 10BBlack: [C: 03+1] Move evaluation of wikimedia_trust/nets to puppet [puppet] - 10https://gerrit.wikimedia.org/r/488445 (https://phabricator.wikimedia.org/T213475) (owner: 10Alexandros Kosiaris) [16:24:07] (03CR) 10BBlack: [C: 03+1] varnish: Add new WMCS IP space as trusted [puppet] - 10https://gerrit.wikimedia.org/r/488516 (https://phabricator.wikimedia.org/T213475) (owner: 10Alexandros Kosiaris) [16:24:15] (03CR) 10GTirloni: [C: 03+1] "> Patch Set 2:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/489409 (https://phabricator.wikimedia.org/T178601) (owner: 10BryanDavis) [16:24:40] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Cmjohnson) racadm sel Record: 29 Date/Time: 02/02/2019 21:20:29 Source: system Severity: Critical Description: CPU 1 machine check error detected. -------------------... [16:25:36] (03PS32) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [16:25:56] (03CR) 10Andrew Bogott: "Latest version of this works immediately on Stretch, and works after a reboot on Jessie. That's tolerable although it would be nice to kn" [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [16:26:11] (03CR) 10Alex Monk: [C: 03+2] certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [16:26:19] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [16:27:10] ACKNOWLEDGEMENT - Disk space on labmon1001 is CRITICAL: DISK CRITICAL - free space: /srv 53092 MB (2% inode=93%): Arturo Borrero Gonzalez Working on it. [16:27:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] Helm chart for eventgate-analytics deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [16:27:53] (03Merged) 10jenkins-bot: certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [16:29:07] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215718 (10jijiki) [16:29:21] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1106 (duration: 00m 52s) [16:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:41] (03CR) 10jenkins-bot: certcentral: Implement staging time [software/certcentral] - 10https://gerrit.wikimedia.org/r/485594 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [16:29:51] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Cmjohnson) Ticket open for a new CPU You have successfully submitted request SR986247109. [16:30:57] (03CR) 10Alex Monk: Rename certcentral to acme-chief (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [16:31:17] !log Reverse password for globaldev user on dbstore1002 - T200801 [16:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:46] (03CR) 10jenkins-bot: mariadb: Depool db1106 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489669 (owner: 10Jcrespo) [16:32:08] (03CR) 10Alex Monk: [C: 04-1] "does not update README.md" [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [16:32:23] arg.. nice catch [16:32:37] (03PS1) 10GTirloni: Revert "wiki replicas: depool labsdb1011 for changes" [puppet] - 10https://gerrit.wikimedia.org/r/489726 (https://phabricator.wikimedia.org/T212308) [16:32:50] ACKNOWLEDGEMENT - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues Arturo Borrero Gonzalez T215817 [16:33:47] (03PS2) 10Muehlenhoff: maps/osm: Stop supporting trusty [puppet] - 10https://gerrit.wikimedia.org/r/489641 [16:35:50] (03CR) 10GTirloni: [C: 03+2] Revert "wiki replicas: depool labsdb1011 for changes" [puppet] - 10https://gerrit.wikimedia.org/r/489726 (https://phabricator.wikimedia.org/T212308) (owner: 10GTirloni) [16:35:53] 10Operations, 10Core Platform Team, 10MediaWiki-Database, 10Wikimedia-Logstash, and 2 others: MediaWiki errors overloading logstash - https://phabricator.wikimedia.org/T215611 (10jijiki) p:05Triage→03High [16:36:03] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10RobH) Chris shipped this, and I just put in an inbound shipemnt ticket for EQ Singapore SG#: 1-185487164544 UPS tracking 1Z291X71DG27842078 [16:36:12] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10RobH) [16:36:18] !log Reload haproxy on dbproxy1010 to repool labsdb1011 [16:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:24] 10Operations: confd: Superfluous golang dependency - https://phabricator.wikimedia.org/T215593 (10jijiki) p:05Triage→03Normal [16:36:30] (03PS4) 10Vgutierrez: Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) [16:36:41] (03PS3) 10Muehlenhoff: maps/osm: Stop supporting trusty [puppet] - 10https://gerrit.wikimedia.org/r/489641 [16:37:19] (03CR) 10Vgutierrez: "> Patch Set 3: Code-Review-1" (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [16:37:26] 10Operations, 10Core Platform Team, 10MediaWiki-Database, 10Wikimedia-Logstash, and 2 others: MediaWiki errors overloading logstash - https://phabricator.wikimedia.org/T215611 (10Anomie) Since the spike was logging from the LoadBalancer layer, pinging @aaron because he knows that code best. For the record... [16:38:24] (03CR) 10Muehlenhoff: [C: 03+2] maps/osm: Stop supporting trusty [puppet] - 10https://gerrit.wikimedia.org/r/489641 (owner: 10Muehlenhoff) [16:38:26] (03CR) 10Bstorm: "> Patch Set 2:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/489409 (https://phabricator.wikimedia.org/T178601) (owner: 10BryanDavis) [16:38:59] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10Cmjohnson) I also put in an in-bound ticket 1-185487164573 [16:39:17] (03CR) 10Thcipriani: Introduce gr-wikimedia-prettify-ci-comments (031 comment) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 (owner: 10Paladox) [16:40:04] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10RobH) So we chatted about this in IRc since we both put in tickets. EQ SG3 is confusing to deal with, and it is likely easier to keep both tickets open and just ensure Arzhel knows about both. [16:40:48] (03PS33) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [16:41:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] Move evaluation of wikimedia_trust/nets to puppet [puppet] - 10https://gerrit.wikimedia.org/r/488445 (https://phabricator.wikimedia.org/T213475) (owner: 10Alexandros Kosiaris) [16:41:17] (03PS7) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 [16:41:20] (03PS5) 10Alexandros Kosiaris: Move evaluation of wikimedia_trust/nets to puppet [puppet] - 10https://gerrit.wikimedia.org/r/488445 (https://phabricator.wikimedia.org/T213475) [16:41:29] (03CR) 10jerkins-bot: [V: 04-1] Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [16:42:03] (03CR) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments (031 comment) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 (owner: 10Paladox) [16:42:51] (03PS1) 10Kosta Harlan: [WIP] GrowthExperiments: Soft launch of help panel on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489729 (https://phabricator.wikimedia.org/T215666) [16:44:09] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10Gilles) @dr0ptp4kt Any of the extension's modules is fine to do this in JS, they're low priority. It does mean that the link will appear un... [16:44:44] (03CR) 10Bstorm: "Is there anything blocking a merge on this? This seems like an important things to move along." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/489409 (https://phabricator.wikimedia.org/T178601) (owner: 10BryanDavis) [16:44:52] (03PS34) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [16:46:07] (03PS35) 10Jbond: Create module for managing ulogd [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) [16:46:30] (03CR) 10Jbond: "Thanks, comments inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/486513 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [16:47:07] (03CR) 10Arturo Borrero Gonzalez: "We have some physical servers running trusty. Those are really close to decom/shutdown/reimage, but anyway Could you please check if this " [puppet] - 10https://gerrit.wikimedia.org/r/489625 (owner: 10Muehlenhoff) [16:47:34] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10RobH) So deleting a ticket rquires us to open a 'delete request' ticket, seems easier to just keep both open and they'll receive the shipment in on one or the other. [16:47:45] (03PS1) 10Vgutierrez: Edit Project Config [software/certcentral] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/489731 [16:47:55] (03PS1) 10Muehlenhoff: profile::redis::multidc: Remove trusty support [puppet] - 10https://gerrit.wikimedia.org/r/489732 [16:48:31] (03PS17) 10Ottomata: Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [16:48:48] (03Abandoned) 10Vgutierrez: Edit Project Config [software/certcentral] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/489731 (owner: 10Vgutierrez) [16:49:11] (03PS2) 10Gilles: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) [16:49:32] (03CR) 10Gilles: Set expiry headers on thumbnails (033 comments) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [16:49:34] !log stop, upgrade and restart db1106 [16:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:56] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[cdh::hadoop::directory /user/spark] [16:50:43] (03CR) 10Muehlenhoff: [C: 03+1] "Seems fine, but needs meeting approval." [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [16:52:55] (03CR) 10Arturo Borrero Gonzalez: "For the record, there is a standard tool which can be used to generate the d/changelog file from a git repository:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/489409 (https://phabricator.wikimedia.org/T178601) (owner: 10BryanDavis) [16:52:58] (03CR) 10Muehlenhoff: [C: 03+1] "Seem fine, but let's wait with merging that until stat1005 has been reimaged to buster. If acked in the SRE meeting I'll take care of merg" [puppet] - 10https://gerrit.wikimedia.org/r/488606 (https://phabricator.wikimedia.org/T215384) (owner: 10Dzahn) [16:55:01] (03CR) 10Alexandros Kosiaris: "One minor thing, rest LGTM" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [16:55:43] Upgrading something? [16:55:44] https://upload.wikimedia.org/wikipedia/commons/thumb/6/67/Ruffhead_-_The_Statutes_at_Large_-_vol_9.djvu/page278-2402px-Ruffhead_-_The_Statutes_at_Large_-_vol_9.djvu.jpg [16:55:46] failed [16:55:53] (03CR) 10Ottomata: Helm chart for eventgate-analytics deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [16:56:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> > profile::wmcs::root_tty" [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [16:56:43] (03PS1) 10Paladox: Update healthcheck [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/489734 [16:56:48] Anyone awake? [16:57:16] yeah we're awake [16:57:37] what's the issue, because I can't repro (yet) [16:58:21] too many redirects [16:58:27] works for me as well [16:58:46] Odd [16:58:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] varnish: Add new WMCS IP space as trusted [puppet] - 10https://gerrit.wikimedia.org/r/488516 (https://phabricator.wikimedia.org/T213475) (owner: 10Alexandros Kosiaris) [16:59:10] I see a huge spike of request rate increase on cache_upload only, which would line up with some kind of redirect loop if that's what's whappening [16:59:13] "The page isn’t redirecting properly [16:59:13] An error occurred during a connection to upload.wikimedia.org. " [16:59:27] corresponding spike in 301 response codes from us as well [16:59:37] Did I break something :( [16:59:46] XD [16:59:51] what's recently deployed? [16:59:54] https://phabricator.wikimedia.org/P8066 [17:00:14] PROBLEM - puppet last run on wdqs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:21] wikimedia_nets change perhaps? [17:00:35] i'm not seeing it on all images [17:00:45] So presumably stuff that's still in caches is OK [17:01:25] 10Operations, 10DBA, 10Packaging: db2085 doesn't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) Same thing just happened with db1106 (PowerEdge R630 - same chassis as db2085) @MoritzMuehlenhoff can you help us with the approach you mentioned at T214840#4918369 ? [17:02:09] I'm reverting the recent varnish wikimedia_trust/_nets thing just in case [17:02:15] akosiaris: ^ [17:02:48] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:50] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) a:03RobH [17:02:52] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:02:55] bblack: ok [17:02:57] (03PS1) 10BBlack: Revert "Move evaluation of wikimedia_trust/nets to puppet" [puppet] - 10https://gerrit.wikimedia.org/r/489737 [17:03:10] PROBLEM - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) is CRITICAL: Test scaled pushpin marker with an icon returned the unexpected status 301 (expecting: 200): /v4/marker/pin-m+ffffff.png (Untitled test) is C [17:03:10] itled test returned the unexpected status 301 (expecting: 200): /osm-intl/9/207/163@1.5x.png (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 301 (expecting: 200): /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 301 (expecting: 200): /_info (Untitled test) [17:03:10] Untitled test returned the unexpected status 301 (expecting: 200): /v4/marker/pin-m+ffffff@2x.png (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /img/osm-intl,1,0.0,0.0,100x100@1.5x.png (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 301 (expecting: 200): /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service i [17:03:10] returned the unexpected status 301 (expecting: 200) [17:03:11] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "Move evaluation of wikimedia_trust/nets to puppet" [puppet] - 10https://gerrit.wikimedia.org/r/489737 (owner: 10BBlack) [17:03:35] <_joe_> oh yeah let's revert right now [17:03:59] I'm pushing it around now [17:04:02] is this localized to just upload ? [17:04:07] yes [17:04:40] (03PS18) 10Ottomata: Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [17:04:56] Okay So something was broken? [17:05:13] looking back at the patch being reverted, it does seem perhaps related that there's a diff for profile::cache::text in there and not a matching one for upload, even though this affects common VCL shared by both [17:05:15] https://upload.wikimedia.org/wikipedia/commons/3/3f/Prasar_Bharti_Logo.jpg [17:05:16] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) is CRITICAL: Test scaled pushpin marker with an icon returned the unexpected status 301 (expecting: 200): /v4/marker/pin-m+ffffff.png (Untitled test) is C [17:05:16] itled test returned the unexpected status 301 (expecting: 200): /osm-intl/9/207/163@1.5x.png (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 301 (expecting: 200): /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 301 (expecting: 200): /_info (Untitled test) [17:05:16] Untitled test returned the unexpected status 301 (expecting: 200): /v4/marker/pin-m+ffffff@2x.png (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /img/osm-intl,1,0.0,0.0,100x100@1.5x.png (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 301 (expecting: 200): /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service i [17:05:16] returned the unexpected status 301 (expecting: 200) [17:05:17] Another user reported upload breaking in -tech [17:05:30] upload.wikimedia.org redirected you too many times. [17:05:30] Try clearing your cookies. [17:05:31] ERR_TOO_MANY_REDIRECTS [17:05:36] yep I have another report from someone with the same [17:05:37] WTF? ^ [17:05:41] (03CR) 10Bstorm: "I wonder about the monitoring part of this. Will it work?" [puppet] - 10https://gerrit.wikimedia.org/r/489393 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [17:05:58] yannf: yep reported by a few folks [17:06:03] <_joe_> bblack: let me know when you have deployed the puppet change [17:06:06] my report was desktop user [17:06:13] (03CR) 10Bstorm: [C: 03+2] "yolo!" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/486417 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [17:06:14] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215718 (10jijiki) p:05Triage→03Normal [17:06:16] I can reproduce on desktop as well [17:06:18] <_joe_> should I think of a way to purge all 301s on upload? [17:06:20] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:32] <_joe_> akosiaris: yes with any non-previously cached url, yep [17:06:44] PROBLEM - puppet last run on mw1310 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:58] I'm getting to many redirects on desktop enwiki now too [17:07:06] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:06] PROBLEM - puppet last run on mw1293 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:07:24] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:07:24] 10Operations, 10DBA, 10Packaging: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) [17:07:51] ah yeah I see the bug now I think [17:07:59] yeah the revert should fix it [17:08:03] i can help coordinating the issue [17:08:09] writing somehting on https://etherpad.wikimedia.org/p/2019-02-11-upload-cache-failure [17:08:25] <_joe_> so if the revert fixes it [17:08:36] sorry I had to stare more just to confirm some things, it cost some time [17:08:38] <_joe_> we just need to purge all urls that return a 301 on upload [17:08:44] it's going out to caches now, and I'm pretty sure that patch is the problem [17:08:46] PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:59] <_joe_> bblack: I'm pretty sure as well. [17:09:08] yeah it's missing a profile::cache::upload part [17:09:10] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:18] <_joe_> akosiaris: exactly [17:09:30] <_joe_> bblack: should I look at banning cache entries? [17:09:47] not sure yet [17:09:56] <_joe_> the puppetmasters have been overwhelmed [17:09:59] <_joe_> a bit [17:10:24] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) is CRITICAL: Test scaled pushpin marker with an icon returned the unexpected status 301 (expecting: 200): /v4/marker/pin-m+ffffff.png (Untitled test) is C [17:10:24] itled test returned the unexpected status 301 (expecting: 200): /osm-intl/9/207/163@1.5x.png (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 301 (expecting: 200): /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 301 (expecting: 200): /_info (Untitled test) [17:10:24] Untitled test returned the unexpected status 301 (expecting: 200): /v4/marker/pin-m+ffffff@2x.png (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /img/osm-intl,1,0.0,0.0,100x100@1.5x.png (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 301 (expecting: 200): /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service i [17:10:24] returned the unexpected status 301 (expecting: 200) [17:10:44] PROBLEM - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) is CRITICAL: Test scaled pushpin marker with an icon returned the unexpected status 301 (expecting: 200): /v4/marker/pin-m+ffffff.png (Untitled test) is C [17:10:44] itled test returned the unexpected status 301 (expecting: 200): /osm-intl/9/207/163@1.5x.png (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 301 (expecting: 200): /osm-intl/11/828/655.png (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 301 (expecting: 200): /_info (Untitled test) [17:10:44] Untitled test returned the unexpected status 301 (expecting: 200): /v4/marker/pin-m+ffffff@2x.png (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 301 (expecting: 200): /img/osm-intl,1,0.0,0.0,100x100@1.5x.png (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 301 (expecting: 200): /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service i [17:10:44] returned the unexpected status 301 (expecting: 200) [17:11:18] does anyone have the whole 301 output captured somewherE? [17:11:25] I'm not sure if it even is cached [17:11:42] RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [17:11:51] (I think it shouldn't be) [17:11:55] bblack: https://phabricator.wikimedia.org/P8067 [17:12:00] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) a:05RobH→03jijiki Ok, so the dimm B1 is reporting bad: ` 7 $> ssh root@thumbor1004.mgmt.eqiad.wmnet root@thumbor1004.mgmt.eqiad.wmnet's password: /admin1-> racadm getsel... [17:12:02] RECOVERY - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [17:12:17] ah some recoveries [17:12:20] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:12:26] right, it's the HTTPS redirects firing when they shouldn't need to, and those are internally-generated on the frontend and not cached [17:12:32] so there should be no need for purge/ban [17:12:33] bblack: yeah it's recovering [17:12:34] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:12:58] re-running puppet again but slower, just in case [17:12:58] getting 200s now [17:13:01] bu tin theory, it's all fixed now [17:13:13] sorry :-( [17:13:20] <_joe_> bblack: confirmed, it's ok [17:13:21] uploads are working for me now [17:13:22] I could've caught it in review too, I got fooled [17:13:30] <_joe_> akosiaris: another sticker for you wheee \o/ [17:13:39] so what was happening, in slightly more depth: [17:13:44] I should have ran a more full PCC [17:13:47] I would have seen it there [17:13:50] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [17:13:56] Can confirm fault cleared for me [17:13:57] 1) The patch was missing a bit for the cache_upload case, which caused the wikimedia_nets/wikimedia_trust ACLs to puppetize as empty sets [17:13:58] :) [17:14:27] 2) The caches only trust the "X-Forwarded-For: https" header based on those ACLs, so it was blanking out that header which it got from our local nginx TLS proxy [17:14:45] 3) Therefore it thought HTTPS traffic was not HTTPS, and emitted a 301 redirect to https:// (same URI) [17:15:12] that's pretty much it [17:15:37] ok, pasting in https://etherpad.wikimedia.org/p/2019-02-11-upload-cache-failure [17:15:45] I 'll do the postmortem after the meeting [17:15:50] ok, thanks! [17:16:02] and sorry, I should've seen that coming when I reviewed :( [17:16:12] sorry as well :-( [17:17:04] kudos, people. You found and fixed this in no time. All's good. [17:17:15] <_joe_> Elitre: let me correct you [17:17:19] <_joe_> we created it [17:17:23] <_joe_> then we fixed it [17:17:41] <_joe_> it's definitely easier to find an issue when it's self-inflicted :P [17:17:46] 10Operations, 10DBA, 10Packaging: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) [17:20:18] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [17:20:45] Yes. and no casualties in between. I call this success. [17:23:42] (03PS2) 10Bstorm: toolforge: Install Tesseract from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/489396 (https://phabricator.wikimedia.org/T215693) (owner: 10BryanDavis) [17:25:37] (03CR) 10Bstorm: [C: 03+2] toolforge: Install Tesseract from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/489396 (https://phabricator.wikimedia.org/T215693) (owner: 10BryanDavis) [17:29:06] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [17:29:12] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:31:54] RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:32:50] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:33:26] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:35:10] RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:35:36] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:35:52] (03CR) 10Alex Monk: [C: 03+2] Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [17:37:47] (03Merged) 10jenkins-bot: Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [17:38:26] RECOVERY - puppet last run on mw1310 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:38:46] RECOVERY - puppet last run on mw1293 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:39:32] (03CR) 10jenkins-bot: Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/489150 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [17:46:41] 10Operations, 10DBA, 10Packaging: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) @paravoid gave us some food for thought: ` stuck at "loading ramdisk" is sometimes an indication of misconfigured serial redirection after boot basically when Linux and the... [17:47:44] (03PS1) 10Cmjohnson: Adding mgmt dns logstash101[0-2] [dns] - 10https://gerrit.wikimedia.org/r/489742 (https://phabricator.wikimedia.org/T214608) [17:48:25] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns logstash101[0-2] [dns] - 10https://gerrit.wikimedia.org/r/489742 (https://phabricator.wikimedia.org/T214608) (owner: 10Cmjohnson) [17:50:27] (03CR) 10Muehlenhoff: "@Arturo: The current code would make the classes fail if run on trusty, so this does not affect any of the remaining trusty WMCS servers i" [puppet] - 10https://gerrit.wikimedia.org/r/489625 (owner: 10Muehlenhoff) [17:54:49] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Halfak) I think our biggest models are around 100MB. I don't expect to have a model larger than 1GB any time soon.... [17:58:54] RECOVERY - Disk space on labmon1001 is OK: DISK OK [18:00:05] gehel and onimisionipe: (Dis)respected human, time to deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190211T1800). Please do the needful. [18:00:30] here here [18:04:12] (03CR) 10Volans: add temporary lvs enabled option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/489628 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [18:04:54] 10Operations, 10SRE-Access-Requests: Requesting access to deployment, contint-admins, and contint-docker for Brennen Bearnes - https://phabricator.wikimedia.org/T215328 (10greg) APPROVED! (sorry!) [18:16:49] (03CR) 10GTirloni: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/489393 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [18:17:35] 10Operations, 10Core Platform Team, 10MediaWiki-Database, 10Wikimedia-Logstash, and 2 others: MediaWiki errors overloading logstash - https://phabricator.wikimedia.org/T215611 (10jcrespo) There was a small mention on this on the SRE meeting, while there was not exact decision, in general there is 2 separat... [18:17:54] (03PS1) 10Mathew.onipe: use underscore for optional args [cookbooks] - 10https://gerrit.wikimedia.org/r/489751 [18:19:32] (03PS1) 10BryanDavis: Fix graphite metric naming [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/489752 (https://phabricator.wikimedia.org/T107878) [18:19:57] (03PS1) 10Cwhite: hiera: upgrade prometheus-node-exporter to 0.17 in labs [puppet] - 10https://gerrit.wikimedia.org/r/489753 (https://phabricator.wikimedia.org/T213708) [18:20:50] (03CR) 10BryanDavis: [C: 03+2] Fix graphite metric naming [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/489752 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [18:21:16] (03PS1) 10Cwhite: hiera: upgrade prometheus-node-exporter to 0.17 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/489754 (https://phabricator.wikimedia.org/T213708) [18:21:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10Cmjohnson) i updated the bios versions on all 3 hosts [18:21:38] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10diego) > > We do have one very large asset file at 1.9GB (word2vec embedding). I don't need that to be much bigge... [18:21:45] (03Merged) 10jenkins-bot: Fix graphite metric naming [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/489752 (https://phabricator.wikimedia.org/T107878) (owner: 10BryanDavis) [18:21:49] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10Cmjohnson) [18:22:11] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10Cmjohnson) a:05Cmjohnson→03RobH assigning to @robh to do the installations [18:25:20] (03PS1) 10Cwhite: prometheus: upgrade prometheus-node-exporter to latest patchset [puppet] - 10https://gerrit.wikimedia.org/r/489756 (https://phabricator.wikimedia.org/T213708) [18:31:46] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata for esanders - https://phabricator.wikimedia.org/T215830 (10Esanders) [18:38:18] (03PS4) 10Zppix: Lift Account creation cap for Women Activists edit-a-thon at Simmons University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487876 (https://phabricator.wikimedia.org/T215069) [18:38:57] (03CR) 10jerkins-bot: [V: 04-1] Lift Account creation cap for Women Activists edit-a-thon at Simmons University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487876 (https://phabricator.wikimedia.org/T215069) (owner: 10Zppix) [18:39:37] (03Abandoned) 10Zppix: Lift Account creation cap for Women Activists edit-a-thon at Simmons University [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487876 (https://phabricator.wikimedia.org/T215069) (owner: 10Zppix) [18:40:18] (03PS1) 10Zppix: Lift account creation cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489761 [18:40:39] (03PS2) 10Zppix: Lift account creation cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489761 [18:41:46] (03CR) 10jerkins-bot: [V: 04-1] Lift account creation cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489761 (owner: 10Zppix) [18:42:44] (03PS3) 10Zppix: Lift account creation cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489761 [18:46:13] jouncebot: reload [18:48:34] (03PS1) 10BryanDavis: Format log messages before passing to Tool.log [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/489762 [18:49:42] (03CR) 10BryanDavis: [C: 03+2] Format log messages before passing to Tool.log [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/489762 (owner: 10BryanDavis) [18:50:10] !log thumbor1004 rebooted and updated firmware [18:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:20] !log thumbor1004 rebooted and updated firmware T215411 [18:50:22] (03Merged) 10jenkins-bot: Format log messages before passing to Tool.log [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/489762 (owner: 10BryanDavis) [18:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:23] T215411: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 [18:52:06] (03PS1) 10BryanDavis: Update d/changelog version number [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/489764 [18:53:42] (03CR) 10BryanDavis: [C: 03+2] Update d/changelog version number [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/489764 (owner: 10BryanDavis) [18:54:17] (03Merged) 10jenkins-bot: Update d/changelog version number [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/489764 (owner: 10BryanDavis) [18:55:29] (03PS1) 10Mathew.onipe: elasticsearch: unassigned shard icinga check [puppet] - 10https://gerrit.wikimedia.org/r/489765 (https://phabricator.wikimedia.org/T212850) [18:55:58] PROBLEM - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 1.364e+06 ge 2.592e+05 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [18:57:46] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1106 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489766 [18:58:45] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) 05Open→03Resolved Ok, updated firmware to System BIOS Version = 2.6.0 revision date of 28 Jun 2018 cleared the SEL and if it alerts again, we now have history of troub... [18:59:16] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) @jijiki pinged you in irc as well, can you return this system to service? [19:00:04] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190211T1900) [19:00:04] Zoranzoki21, kostajh, and Zppix: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:19] \o/ [19:00:22] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1106 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489766 (owner: 10Jcrespo) [19:00:23] here [19:01:27] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1106 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489766 (owner: 10Jcrespo) [19:02:51] I can do the SWAT [19:03:37] (03CR) 10Dzahn: [C: 03+1] "has been ACKed in SRE meeting, waiting for buster install, letting Moritz do the merge" [puppet] - 10https://gerrit.wikimedia.org/r/488606 (https://phabricator.wikimedia.org/T215384) (owner: 10Dzahn) [19:04:01] (03PS4) 10Catrope: Lift account creation cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489761 (https://phabricator.wikimedia.org/T215069) (owner: 10Zppix) [19:04:09] (03PS5) 10Catrope: Lift account creation cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489761 (https://phabricator.wikimedia.org/T215069) (owner: 10Zppix) [19:04:15] (03CR) 10Catrope: [C: 03+2] Lift account creation cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489761 (https://phabricator.wikimedia.org/T215069) (owner: 10Zppix) [19:05:20] (03Merged) 10jenkins-bot: Lift account creation cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489761 (https://phabricator.wikimedia.org/T215069) (owner: 10Zppix) [19:06:21] (03CR) 10Dzahn: [C: 03+1] "was in SRE meeting and had no objections" [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [19:07:43] (03PS1) 10Mathew.onipe: maps: update path for postgis script [puppet] - 10https://gerrit.wikimedia.org/r/489769 (https://phabricator.wikimedia.org/T215521) [19:07:48] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Erik Bernhardson to have root access on stat1005 for GPU testing - https://phabricator.wikimedia.org/T215384 (10Dzahn) Approved in SRE meeting (SRE-2019-02-11#Access_Requests) [19:08:04] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) @RobH Server has been repooled [19:08:20] !log Repooled thumbor1004 - T215411 [19:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:23] T215411: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 [19:08:29] !log catrope@deploy1001 Synchronized wmf-config/throttle.php: Lift account creation cap for edit-a-thon (T215069) (duration: 00m 47s) [19:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:31] T215069: Lift Account creation cap for Women Activists edit-a-thon at Simmons University 2019 (Feb 14) - https://phabricator.wikimedia.org/T215069 [19:08:38] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Erik Bernhardson to have root access on stat1005 for GPU testing - https://phabricator.wikimedia.org/T215384 (10Dzahn) Normally would have merged Gerrit change but see comments from Moritz there, he said we should wait until buster... [19:09:13] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1106 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489766 (owner: 10Jcrespo) [19:09:15] (03CR) 10jenkins-bot: Lift account creation cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489761 (https://phabricator.wikimedia.org/T215069) (owner: 10Zppix) [19:09:46] oh, it is next hour already [19:09:53] I went over time [19:09:57] sorry, will deploy later [19:10:40] ping me when mw deploy is finished [19:10:47] Will do [19:11:39] thanks RoanKattouw [19:11:45] kostajh: Your patch is on mwdebug1002, please test [19:12:11] RoanKattouw: looking [19:14:15] (03PS2) 10Jcrespo: mariadb: Pool db1118 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489282 (https://phabricator.wikimedia.org/T214720) [19:15:23] RoanKattouw: looks good [19:16:54] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/GrowthExperiments/: Help panel search instrumentation (T211166) (duration: 00m 47s) [19:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:57] T211166: Help panel: instrumentation for search - https://phabricator.wikimedia.org/T211166 [19:17:02] (03CR) 10Dzahn: [C: 04-1] "Resource type not found: Stdlib::Ipaddress" [puppet] - 10https://gerrit.wikimedia.org/r/489347 (owner: 10Dzahn) [19:20:26] (03CR) 10Catrope: [C: 03+2] Set wgRestrictionLevels for all Serbian projects to all groups (autopatrol, patroller, rollbacker, bot, sysop, bureaucrat) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (https://phabricator.wikimedia.org/T215653) (owner: 10Zoranzoki21) [19:20:35] (03PS10) 10Catrope: Set wgRestrictionLevels for all Serbian projects to all groups (autopatrol, patroller, rollbacker, bot, sysop, bureaucrat) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (https://phabricator.wikimedia.org/T215653) (owner: 10Zoranzoki21) [19:20:40] (03CR) 10Catrope: Set wgRestrictionLevels for all Serbian projects to all groups (autopatrol, patroller, rollbacker, bot, sysop, bureaucrat) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (https://phabricator.wikimedia.org/T215653) (owner: 10Zoranzoki21) [19:20:44] (03CR) 10Catrope: [C: 03+2] Set wgRestrictionLevels for all Serbian projects to all groups (autopatrol, patroller, rollbacker, bot, sysop, bureaucrat) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (https://phabricator.wikimedia.org/T215653) (owner: 10Zoranzoki21) [19:21:34] jdlrobson: Could you help me test T215709 ? Zoranzoki21 isn't here right now, and I know how to test his other patch but not that one [19:21:35] T215709: MobileGateway: Please turn off main page special casing for lawiki - https://phabricator.wikimedia.org/T215709 [19:21:49] (03Merged) 10jenkins-bot: Set wgRestrictionLevels for all Serbian projects to all groups (autopatrol, patroller, rollbacker, bot, sysop, bureaucrat) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (https://phabricator.wikimedia.org/T215653) (owner: 10Zoranzoki21) [19:22:56] (03PS3) 10Dzahn: tor: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/489347 [19:26:08] (03CR) 10Dzahn: [C: 04-1] "port numbers are still strings" [puppet] - 10https://gerrit.wikimedia.org/r/489347 (owner: 10Dzahn) [19:28:27] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set wgRestrictionLevels on Serbian projects (T215653) (duration: 00m 46s) [19:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:30] T215653: Set wgRestrictionLevels for all Serbian projects to all available groups - https://phabricator.wikimedia.org/T215653 [19:28:53] (03PS4) 10Dzahn: tor: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/489347 [19:30:03] RoanKattouw: on it [19:30:24] RoanKattouw: what should i be testing on? [19:30:43] mwdebug1002 in a minute, lemme put it there [19:30:52] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14603/torrelay1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/489347 (owner: 10Dzahn) [19:30:56] (03PS3) 10Catrope: Remove main page special casing from lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489420 (https://phabricator.wikimedia.org/T215709) (owner: 10Zoranzoki21) [19:31:03] (03PS5) 10Dzahn: tor: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/489347 [19:31:06] (03CR) 10Catrope: [C: 03+2] Remove main page special casing from lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489420 (https://phabricator.wikimedia.org/T215709) (owner: 10Zoranzoki21) [19:32:02] (03Merged) 10jenkins-bot: Remove main page special casing from lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489420 (https://phabricator.wikimedia.org/T215709) (owner: 10Zoranzoki21) [19:32:19] (03CR) 10jenkins-bot: Set wgRestrictionLevels for all Serbian projects to all groups (autopatrol, patroller, rollbacker, bot, sysop, bureaucrat) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (https://phabricator.wikimedia.org/T215653) (owner: 10Zoranzoki21) [19:32:21] (03CR) 10jenkins-bot: Remove main page special casing from lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489420 (https://phabricator.wikimedia.org/T215709) (owner: 10Zoranzoki21) [19:32:22] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata for esanders - https://phabricator.wikimedia.org/T215830 (10Neil_P._Quinn_WMF) For the record, he needs this access so he can inspect EventLogging data related to the Editing team's products in Hadoop (one of the most important dat... [19:33:05] jdlrobson: OK it's on mwdebug1002 now, sorry for the delay [19:33:33] RoanKattouw: confirmed it's working! [19:33:41] Yay! Deploying [19:34:21] (03CR) 10Dzahn: [C: 03+1] "a care where the Gerrit "assignee" feature is useful" [puppet] - 10https://gerrit.wikimedia.org/r/488606 (https://phabricator.wikimedia.org/T215384) (owner: 10Dzahn) [19:34:50] !log catrope@deploy1001 Synchronized dblists/mobilemainpagelegacy.dblist: Remove main page special casing from lawiki (T215709) (duration: 00m 46s) [19:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:53] T215709: MobileGateway: Please turn off main page special casing for lawiki - https://phabricator.wikimedia.org/T215709 [19:35:12] (03PS3) 10Dzahn: gerrit: Remove css for iron-icons [puppet] - 10https://gerrit.wikimedia.org/r/489363 (owner: 10Paladox) [19:35:38] (03CR) 10Dzahn: [C: 03+2] gerrit: Remove css for iron-icons [puppet] - 10https://gerrit.wikimedia.org/r/489363 (owner: 10Paladox) [19:35:57] OK, that's SWAT done [19:36:07] jynus: Back to you [19:36:15] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/489393 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [19:36:46] (03PS5) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) [19:36:55] thanks moritzm [19:36:58] * mutante [19:37:28] (03CR) 10MSantos: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/489769 (https://phabricator.wikimedia.org/T215521) (owner: 10Mathew.onipe) [19:37:35] RoanKattouw: thanks [19:37:54] (03CR) 10Jcrespo: [C: 03+2] mariadb: Pool db1118 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489282 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [19:38:15] (03PS3) 10Paladox: gerrit: Increase httpd.threads in gerrit config [puppet] - 10https://gerrit.wikimedia.org/r/489475 [19:38:19] (03PS1) 10Cwhite: admin: add nharateh to ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/489772 (https://phabricator.wikimedia.org/T215574) [19:38:42] paladox: np! applied on cobalt (prod) now [19:38:47] :) [19:38:58] (03Merged) 10jenkins-bot: mariadb: Pool db1118 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489282 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [19:42:38] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1106, db1118 with full weight (duration: 00m 46s) [19:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:04] (03PS2) 10Bstorm: wiki replicas: Expose ipblocks_restrictions table [puppet] - 10https://gerrit.wikimedia.org/r/489576 (https://phabricator.wikimedia.org/T209819) (owner: 10BryanDavis) [19:43:37] (03CR) 10jenkins-bot: mariadb: Pool db1118 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489282 (https://phabricator.wikimedia.org/T214720) (owner: 10Jcrespo) [19:51:17] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata for esanders - https://phabricator.wikimedia.org/T215830 (10Esanders) cc @marcella for approval [19:52:54] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata for esanders - https://phabricator.wikimedia.org/T215830 (10marcella) Approved. [19:53:28] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata for esanders - https://phabricator.wikimedia.org/T215830 (10Nuria) Approved on my end, SRE team will take care of the code changes needed. [19:55:33] (03CR) 10Dzahn: [V: 03+1 C: 03+1] admin: add nharateh to ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/489772 (https://phabricator.wikimedia.org/T215574) (owner: 10Cwhite) [19:58:17] (03PS1) 10Cwhite: admin: add Petar Petkovic to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/489775 (https://phabricator.wikimedia.org/T215575) [19:59:00] (03CR) 10Dzahn: "Volans/CDanis: Paladox and Tyler installed the new Gerrit health check plugin. It checks several things internally and would return non-20" [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [20:00:08] (03CR) 10CDanis: "> Volans/CDanis: Paladox and Tyler installed the new Gerrit health" [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [20:01:05] (03CR) 10Bstorm: [C: 03+2] wiki replicas: Expose ipblocks_restrictions table [puppet] - 10https://gerrit.wikimedia.org/r/489576 (https://phabricator.wikimedia.org/T209819) (owner: 10BryanDavis) [20:01:07] (03CR) 10Dzahn: admin: add Petar Petkovic to ldap_only_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489775 (https://phabricator.wikimedia.org/T215575) (owner: 10Cwhite) [20:01:09] (03CR) 10Paladox: "This basically does both in one check. So if there's highload the timeout on the health check side will hit leading to a 500 error regardl" [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [20:05:41] (03PS6) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) [20:05:54] (03PS7) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) [20:06:33] ottomata & elukey: the search team’s NLP contractor (user juliaglen or maybe Julia.glen) has access to the analytics machines, but doesn’t seem to have access to hue.wikimedia.org. Wikitech docs indicate that asking you two might be the next step. Help? Thanks! [20:06:54] (03PS1) 10Cwhite: admin: add Runa Bhattacharjee to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/489776 (https://phabricator.wikimedia.org/T215576) [20:06:56] (03Abandoned) 10Dzahn: convert simplelamp from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/489326 (https://phabricator.wikimedia.org/T215662) (owner: 10Dzahn) [20:07:08] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@bdb4740]: Update dependencies, minor refactor, safer deduplication, T207329 [20:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:10] (03Abandoned) 10Dzahn: convert simplelamp from mysql to mariadb [puppet] - 10https://gerrit.wikimedia.org/r/489328 (https://phabricator.wikimedia.org/T215662) (owner: 10Dzahn) [20:07:11] T207329: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 [20:07:23] (03CR) 10Volans: [C: 04-1] "I think we could have both for now and then once we're happy with this new one consider if having the other one is just redundant." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [20:08:44] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@bdb4740]: Update dependencies, minor refactor, safer deduplication, T207329 (duration: 01m 37s) [20:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:59] jouncebot: next [20:08:59] In 0 hour(s) and 51 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190211T2100) [20:09:04] Trey314159: username? [20:09:35] (03CR) 10Dzahn: admin: add Runa Bhattacharjee to ldap_only_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489776 (https://phabricator.wikimedia.org/T215576) (owner: 10Cwhite) [20:09:54] ottomata: juliaglen (analytics login) or Julia.glen (gerrit login), depending on how it's been transformed along the way. [20:10:02] OH sorry you said that [20:10:07] no worries! [20:10:41] Trey314159: done! [20:10:46] have her try now [20:10:48] Cool. Thanks! [20:10:48] (03CR) 10Paladox: Gerrit: Update icinga check to use healthcheck endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [20:11:18] there are some spike of lag on enwiki on codfw, but as long as it is only codfw doesn't seem like a big deal [20:16:06] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [20:17:12] (03PS2) 10Gehel: maps: update path for postgis script [puppet] - 10https://gerrit.wikimedia.org/r/489769 (https://phabricator.wikimedia.org/T215521) (owner: 10Mathew.onipe) [20:18:42] (03CR) 10Gehel: [C: 03+2] maps: update path for postgis script [puppet] - 10https://gerrit.wikimedia.org/r/489769 (https://phabricator.wikimedia.org/T215521) (owner: 10Mathew.onipe) [20:19:40] bstorm_: looks like you have an unmerged change on puppet, looks trivial enough, but can you confirm I can merge it? [20:19:48] Do I? [20:20:00] Hi, which is username of Catrope on IRC? [20:20:08] I need some help for T215653 [20:20:09] T215653: Set wgRestrictionLevels for all Serbian projects to all available groups - https://phabricator.wikimedia.org/T215653 [20:20:14] Yes please...It aborted my merge :) [20:20:24] bstorm_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/489576 [20:20:27] For reasons...probably because I waited too long? [20:20:46] bstorm_: no idea, but I'm merging it. Thanks! [20:20:55] Merge these changes? (yes/no)? yes [20:20:55] Aborting merge. [20:21:01] I think I forgot to hit "enter" [20:21:10] Because someone else talked to me [20:21:14] 😁 [20:21:17] never happened to me :) [20:21:20] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [20:21:25] Yay [20:21:52] 10Operations, 10hardware-requests: eqiad: requesting dual cpu misc host for icinga1001 replacement - https://phabricator.wikimedia.org/T215837 (10RobH) p:05Triage→03Normal [20:22:24] 10Operations, 10hardware-requests: eqiad: requesting dual cpu misc host for icinga1001 replacement - https://phabricator.wikimedia.org/T215837 (10RobH) We'll need management approval on which task to assign WMF7426. Aside comment: @robh will file a task to order more dual cpu spare pool systems. [20:26:43] (03PS1) 10Volans: Failover icinga to icinga2001 [dns] - 10https://gerrit.wikimedia.org/r/489777 (https://phabricator.wikimedia.org/T214760) [20:36:29] (03PS1) 10EBernhardson: Promote new wbsearchentities profiles to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489780 (https://phabricator.wikimedia.org/T214515) [20:36:34] (03PS1) 10Ottomata: Use wgRCFeeds without wgRCEngines for EventBus RCFeed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489781 (https://phabricator.wikimedia.org/T215834) [20:37:32] (03CR) 10jerkins-bot: [V: 04-1] Use wgRCFeeds without wgRCEngines for EventBus RCFeed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489781 (https://phabricator.wikimedia.org/T215834) (owner: 10Ottomata) [20:38:24] (03PS2) 10Ottomata: Use wgRCFeeds without wgRCEngines for EventBus RCFeed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489781 (https://phabricator.wikimedia.org/T215834) [20:49:51] (03CR) 10Ppchelko: Use wgRCFeeds without wgRCEngines for EventBus RCFeed (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489781 (https://phabricator.wikimedia.org/T215834) (owner: 10Ottomata) [20:52:06] (03CR) 10Ottomata: Use wgRCFeeds without wgRCEngines for EventBus RCFeed (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489781 (https://phabricator.wikimedia.org/T215834) (owner: 10Ottomata) [20:53:15] (03CR) 10Ppchelko: [C: 03+1] "Hm.. Indeed.. Ok, makes sense. When are you planning to SWAT this, I can be around to help monitor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489781 (https://phabricator.wikimedia.org/T215834) (owner: 10Ottomata) [20:53:50] (03PS1) 10Volans: icinga: cleanup legacy code [puppet] - 10https://gerrit.wikimedia.org/r/489790 (https://phabricator.wikimedia.org/T214760) [20:53:52] (03PS1) 10Volans: icinga: failover to icinga2001 [puppet] - 10https://gerrit.wikimedia.org/r/489791 (https://phabricator.wikimedia.org/T214760) [20:53:54] (03CR) 10Ottomata: "This should be ready to go anytime." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489781 (https://phabricator.wikimedia.org/T215834) (owner: 10Ottomata) [20:54:07] Pchelolo: can we do it now? [20:54:48] ottomata: shouldn't the config changes go in the swat? [20:54:52] oh yes [20:55:03] sorry, i don't do these kind of deploys often and don't know the proper procedures well [20:55:15] looking for docs on when/how... [20:56:14] Pchelolo: i haave 4 mins to get it into the next swat.... [20:56:19] sound ok to you? [20:56:32] OH [20:56:34] no that is not a swat [20:56:42] sorry, just an unrelated scheduled deployment [20:56:45] ...learning [20:58:01] Pchelolo: ok the swat schedules are really bad for me... [20:58:18] there's one in 3 hours, can you babysit it then if i scheulde it? [20:58:27] (03PS2) 10Volans: icinga: cleanup legacy code [puppet] - 10https://gerrit.wikimedia.org/r/489790 (https://phabricator.wikimedia.org/T214760) [20:58:28] i can be around if something breaks (you can sms me) [20:58:29] (03PS2) 10Volans: icinga: failover to icinga2001 [puppet] - 10https://gerrit.wikimedia.org/r/489791 (https://phabricator.wikimedia.org/T214760) [20:58:39] but i'm having some friends over around then [20:59:47] (03CR) 10Dzahn: "instead of "check_https_url_at_address_for_minsize" use another one of the check commands defined in the same file. there is "for_string" " [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190211T2100). [21:01:25] (03CR) 10Dzahn: ""profile::tcpircbot::ensure:" is also different between the 2 hosts, should be disabled on one of them" [puppet] - 10https://gerrit.wikimedia.org/r/489790 (https://phabricator.wikimedia.org/T214760) (owner: 10Volans) [21:02:59] mutante: yeah saw the diff in the compiler, although I don't like the current structure [21:03:14] one bot is included in icinga.pp if-walled with is_passive [21:03:24] volans: well, definitely +1 to deleting einsteinium.yaml to start with [21:03:24] while this other ones is just included directly in teh role,meh [21:03:39] I'll keep the tcpircbot hiera key for now [21:04:38] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@c6a6285]: Weekly GUI deploy [21:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:43] yea, right.. it's meh [21:05:04] (03CR) 10Ottomata: [C: 03+2] Use wgRCFeeds without wgRCEngines for EventBus RCFeed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489781 (https://phabricator.wikimedia.org/T215834) (owner: 10Ottomata) [21:05:09] but probably a separate change is better [21:06:25] yeah [21:07:32] +1 [21:07:58] (03CR) 10Dzahn: [C: 03+1] "i actually meant to do this merge into the role after einsteinium was gone and then forgot about it, also definitely +1 for deleting einst" [puppet] - 10https://gerrit.wikimedia.org/r/489790 (https://phabricator.wikimedia.org/T214760) (owner: 10Volans) [21:09:15] !log mobrovac@deploy1001 Started deploy [citoid/deploy@0b91bea]: Use Zotero for DOIs and pass it the A-L header - T214766 T210806 T215755 [21:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:21] T210806: Decreased internationalisation of automatic citations as a result of switch to new translation-server - https://phabricator.wikimedia.org/T210806 [21:09:21] T214766: DOIs with unscrapeable pages are not merged with crossref / repository metadata - https://phabricator.wikimedia.org/T214766 [21:09:22] T215755: Tamu.edu DOI is not correctly recognized by Citoid - https://phabricator.wikimedia.org/T215755 [21:11:27] (03PS3) 10Volans: icinga: cleanup legacy code [puppet] - 10https://gerrit.wikimedia.org/r/489790 (https://phabricator.wikimedia.org/T214760) [21:11:29] (03PS3) 10Volans: icinga: failover to icinga2001 [puppet] - 10https://gerrit.wikimedia.org/r/489791 (https://phabricator.wikimedia.org/T214760) [21:13:03] !log mobrovac@deploy1001 Finished deploy [citoid/deploy@0b91bea]: Use Zotero for DOIs and pass it the A-L header - T214766 T210806 T215755 (duration: 03m 47s) [21:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:18] (03CR) 10Volans: "Compiler agrees it's a noop:" [puppet] - 10https://gerrit.wikimedia.org/r/489790 (https://phabricator.wikimedia.org/T214760) (owner: 10Volans) [21:13:38] (03CR) 10BryanDavis: [C: 03+1] "> Reading https://phabricator.wikimedia.org/T209011 I gather that" [puppet] - 10https://gerrit.wikimedia.org/r/488516 (https://phabricator.wikimedia.org/T213475) (owner: 10Alexandros Kosiaris) [21:14:06] (03CR) 10CDanis: [C: 03+1] icinga: cleanup legacy code [puppet] - 10https://gerrit.wikimedia.org/r/489790 (https://phabricator.wikimedia.org/T214760) (owner: 10Volans) [21:14:13] mobrovac: let me know when you're done with deploying [21:14:24] arlolra: {{done}}, all yours [21:14:29] thanks [21:14:39] !log arlolra@deploy1001 Started deploy [parsoid/deploy@4e9b142]: Updating Parsoid to b4b9603 [21:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:39] (03PS1) 10Paladox: Add LICENSE(Apache V2) to go-import plugin [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489804 [21:15:41] (03PS1) 10Paladox: Merge branch 'stable-2.14' into stable-2.15 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489805 [21:15:43] (03PS1) 10Paladox: Upgrade bazlets to latest stable-2.14 to build with 2.14.18 API [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489806 [21:15:44] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:15:45] (03PS1) 10Paladox: Upgrade bazlets to latest stable-2.15 to build with 2.15.8 API [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489807 [21:15:47] (03PS1) 10Paladox: Merge branch 'stable-2.14' into stable-2.15 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489808 [21:15:49] (03PS1) 10Paladox: Upgrade bazlets to latest stable-2.15 to build with 2.15.9 API [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489809 [21:15:51] (03CR) 10jenkins-bot: Use wgRCFeeds without wgRCEngines for EventBus RCFeed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489781 (https://phabricator.wikimedia.org/T215834) (owner: 10Ottomata) [21:15:53] (03PS1) 10Paladox: Update mockito-core to 2.24.0 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489810 [21:15:55] (03PS1) 10Paladox: Merge branch 'stable-2.14' into stable-2.15 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489811 [21:15:57] (03PS1) 10Paladox: Upgrade bazlets to latest stable-2.15 to build with 2.15.10 API [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489812 [21:15:57] um [21:15:59] (03PS1) 10Paladox: Merge branch 'stable-2.15' of https://gerrit.googlesource.com/plugins/go-import into stable-2.15 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489813 [21:16:12] That was ment to be a merge commit [21:16:32] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@c6a6285]: Weekly GUI deploy (duration: 11m 54s) [21:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:44] (03CR) 10Volans: "Catalog compilation for this one is:" [puppet] - 10https://gerrit.wikimedia.org/r/489791 (https://phabricator.wikimedia.org/T214760) (owner: 10Volans) [21:18:07] (03CR) 10CDanis: [C: 03+1] icinga: failover to icinga2001 [puppet] - 10https://gerrit.wikimedia.org/r/489791 (https://phabricator.wikimedia.org/T214760) (owner: 10Volans) [21:18:22] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:19:43] (03PS1) 10Paladox: Merge remote-tracking branch 'upstream/stable-2.15/stable-2.15' into stable-2.15 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489815 [21:20:30] !log deploying mediawiki-config change for update to EventBus RCFeed config (no-op) [21:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:33] (03Abandoned) 10Paladox: Merge branch 'stable-2.15' of https://gerrit.googlesource.com/plugins/go-import into stable-2.15 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489813 (owner: 10Paladox) [21:20:36] (03Abandoned) 10Paladox: Upgrade bazlets to latest stable-2.15 to build with 2.15.10 API [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489812 (owner: 10Paladox) [21:20:40] (03Abandoned) 10Paladox: Merge branch 'stable-2.14' into stable-2.15 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489811 (owner: 10Paladox) [21:20:43] (03Abandoned) 10Paladox: Update mockito-core to 2.24.0 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489810 (owner: 10Paladox) [21:20:45] (03Abandoned) 10Paladox: Upgrade bazlets to latest stable-2.14 to build with 2.14.18 API [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489806 (owner: 10Paladox) [21:20:48] (03Abandoned) 10Paladox: Add LICENSE(Apache V2) to go-import plugin [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489804 (owner: 10Paladox) [21:21:02] sorry for spam [21:21:11] (03Abandoned) 10Paladox: Upgrade bazlets to latest stable-2.15 to build with 2.15.9 API [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489809 (owner: 10Paladox) [21:21:15] (03Abandoned) 10Paladox: Merge branch 'stable-2.14' into stable-2.15 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489808 (owner: 10Paladox) [21:21:21] (03Abandoned) 10Paladox: Upgrade bazlets to latest stable-2.15 to build with 2.15.8 API [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489807 (owner: 10Paladox) [21:21:25] (03CR) 10Volans: [C: 03+2] icinga: cleanup legacy code [puppet] - 10https://gerrit.wikimedia.org/r/489790 (https://phabricator.wikimedia.org/T214760) (owner: 10Volans) [21:21:27] (03Abandoned) 10Paladox: Merge branch 'stable-2.14' into stable-2.15 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489805 (owner: 10Paladox) [21:22:17] (03CR) 10Paladox: [V: 03+2 C: 03+2] "Safe to merge, no functional changes, merges upstream 2.15 branch." [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489815 (owner: 10Paladox) [21:22:29] !log otto@deploy1001 Synchronized wmf-config/CommonSettings.php: Use newer RCFeed config for EventBus based recentchange event - T215834 (duration: 00m 47s) [21:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:50] T215834: EventBusRCFeedEngine should use FormattedRCFeed instead of RCFeedEngine to use updated configuration - https://phabricator.wikimedia.org/T215834 [21:22:53] Pchelolo: looks good. [21:22:58] (03PS2) 10Paladox: Merge remote-tracking branch 'upstream/stable-2.15/stable-2.15' into stable-2.15 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489815 [21:23:03] (03CR) 10Paladox: [V: 03+2 C: 03+2] Merge remote-tracking branch 'upstream/stable-2.15/stable-2.15' into stable-2.15 [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/489815 (owner: 10Paladox) [21:23:22] ottomata: indeed it does! [21:23:36] next [21:23:36] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/489786/ [21:23:50] so, iiuc, i can just merge this, it will be deployed in beta, we can verify, and then it will go out with the next train? [21:23:53] correct? [21:24:12] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@4e9b142]: Updating Parsoid to b4b9603 (duration: 09m 33s) [21:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:23] although, hm, maybe that one isn't necessary, i can role that into the multi enpdoint patch [21:26:18] (03CR) 10Dzahn: "so i already asked if we need to check for the string "passed" but we should not have to, paladox said it returns non-200 if any one of th" [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [21:27:38] (03PS1) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489818 [21:27:40] (03PS1) 10Zoranzoki21: Add new throttle rule for T215839 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489819 (https://phabricator.wikimedia.org/T215839) [21:28:00] (03Abandoned) 10Zoranzoki21: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489818 (owner: 10Zoranzoki21) [21:31:27] (03CR) 10Dzahn: [C: 04-1] "it's likely not going to be phab1002 but phab1003 and also not running both in parallel" [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [21:32:36] (03PS2) 10Dzahn: icinga/parsoid: no monitoring notifications on test servers [puppet] - 10https://gerrit.wikimedia.org/r/487964 (https://phabricator.wikimedia.org/T201366) [21:32:55] !log Updated Parsoid to b4b9603 (T208901, T215537, T213468, T215638) [21:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:02] T215537: Investigate 500s from batch request failures - https://phabricator.wikimedia.org/T215537 [21:33:03] T215638: List tokens use special-cased "bullets" property instead of stuffing it in attribs like other tokens - https://phabricator.wikimedia.org/T215638 [21:33:03] T208901: TemplateStyles breaks a paragraph if a file is inserted inline - https://phabricator.wikimedia.org/T208901 [21:33:03] T213468: Parsoid section IDs don't correspond to PHP section IDs when headings are transcluded - https://phabricator.wikimedia.org/T213468 [21:33:36] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.124 second response time [21:34:01] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Release-Engineering-Team (Watching / External), 10Services (watching): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10mobrovac) >>! In T205911... [21:35:16] (03CR) 10Dzahn: [C: 03+2] "only affects parsoid-test host scandium" [puppet] - 10https://gerrit.wikimedia.org/r/487964 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [21:36:14] ACKNOWLEDGEMENT - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.124 second response time andrew bogott looking... [21:37:00] (03PS1) 10Paladox: Use anonymous project clone URLs [software/gerrit/plugins/go-import] (stable-2.16) - 10https://gerrit.wikimedia.org/r/489893 [21:38:01] (03CR) 10Paladox: [V: 03+2 C: 03+2] "Already merged in the 2.15 branch." [software/gerrit/plugins/go-import] (stable-2.16) - 10https://gerrit.wikimedia.org/r/489893 (owner: 10Paladox) [21:38:54] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.124 second response time [21:41:22] 10Operations, 10Mail, 10WMF-Legal: Tracking down gary@ and redirecting it to trustandsafety@ - https://phabricator.wikimedia.org/T210464 (10bcampbell) Hey folks. Not sure what to do for this task now that James is gone. Should I make the changes I suggested? [21:43:03] 10Operations, 10ops-eqsin, 10Traffic: Degraded RAID on cp5010 - https://phabricator.wikimedia.org/T214274 (10RobH) a:03ayounsi Ok, support case opened with Dell and a replacement SSD has been dispatched. details below: * Dell case SFDC 21867874 * DPS Tracking for SSD: 91913423457 * EQ SG3 inbound shipmen... [21:49:58] (03PS2) 10Dzahn: testreduce: no require_package for nodejs, avoid dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/484811 (https://phabricator.wikimedia.org/T201366) [21:50:43] (03PS2) 10Cwhite: admin: add nharateh to ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/489772 (https://phabricator.wikimedia.org/T215574) [21:50:56] (03CR) 10Cwhite: [C: 03+2] admin: add nharateh to ldap-only users [puppet] - 10https://gerrit.wikimedia.org/r/489772 (https://phabricator.wikimedia.org/T215574) (owner: 10Cwhite) [21:55:25] (03PS1) 10CDanis: ircecho: ensure=>running should not be necessary [puppet] - 10https://gerrit.wikimedia.org/r/489897 [21:57:46] (03CR) 10Volans: [C: 03+1] "LGTM, this should fix puppet on icinga2001 that is already broken since a while" [puppet] - 10https://gerrit.wikimedia.org/r/489897 (owner: 10CDanis) [21:57:57] (03PS2) 10CDanis: ircecho: ensure=>running not necessary/is harmful [puppet] - 10https://gerrit.wikimedia.org/r/489897 [21:58:03] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14608/scandium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/484811 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [21:58:09] (03CR) 10CDanis: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/14609/console" [puppet] - 10https://gerrit.wikimedia.org/r/489897 (owner: 10CDanis) [21:58:35] (03CR) 10CDanis: [C: 03+2] ircecho: ensure=>running not necessary/is harmful [puppet] - 10https://gerrit.wikimedia.org/r/489897 (owner: 10CDanis) [21:58:43] (03PS3) 10CDanis: ircecho: ensure=>running not necessary/is harmful [puppet] - 10https://gerrit.wikimedia.org/r/489897 [22:00:04] bawolff and Reedy: Dear deployers, time to do the Weekly Security deployment window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190211T2200). [22:02:17] (03PS3) 10Dzahn: testreduce: no require_package for nodejs, avoid dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/484811 (https://phabricator.wikimedia.org/T201366) [22:06:02] 10Operations, 10Mail, 10WMF-Legal: Tracking down gary@ and redirecting it to trustandsafety@ - https://phabricator.wikimedia.org/T210464 (10Dzahn) Pinging @jrbs to help answering that question, per IRC. [22:06:34] hey Reedy / bawolff (who doesn't seem to be here?) -- is there anything for the security deploy today? [22:06:41] not seeing anything on wikitech [22:06:46] DEPLOY ALL THE THINGS [22:06:56] cdanis: We don't tend to list them on wikitech for obvious reasons :) [22:07:07] lol, sensible [22:07:20] haha [22:08:06] (03PS1) 10Paladox: Initial stable-2.16 fork for wikimedia [software/gerrit] (stable-2.16) - 10https://gerrit.wikimedia.org/r/489902 [22:08:15] (03Abandoned) 10Paladox: Initial stable-2.16 fork for wikimedia [software/gerrit] (stable-2.16) - 10https://gerrit.wikimedia.org/r/489902 (owner: 10Paladox) [22:08:58] (03PS1) 10Paladox: Initial stable-2.16 fork for wikimedia [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/489903 [22:09:41] Reedy: it's just that we need to failover icinga and don't want to do that while there are deployments, hence checking ;) [22:09:46] could you ping us when you're done? [22:09:52] I'm not deploying anything [22:09:53] :) [22:10:19] ack, thks ;) [22:17:28] FYI icinga failover coming up in ~20 minutes [22:18:05] (03PS2) 10Paladox: Initial stable-2.16 fork for wikimedia [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/489903 [22:20:19] (03PS3) 10Paladox: Initial stable-2.16 fork for wikimedia [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/489903 [22:21:25] (03PS1) 10Paladox: Remove left over plugins/reviewers-by-blame [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/489904 [22:21:39] (03PS2) 10Paladox: Remove left over plugins/reviewers-by-blame [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/489904 [22:21:46] (03CR) 10Paladox: [V: 03+2 C: 03+2] Remove left over plugins/reviewers-by-blame [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/489904 (owner: 10Paladox) [22:23:43] (03PS4) 10Paladox: Initial stable-2.16 fork for wikimedia [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/489903 [22:34:20] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [22:34:23] !log failing over icinga to icinga2001 [22:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:19] (03CR) 10CDanis: [C: 03+2] icinga: failover to icinga2001 [puppet] - 10https://gerrit.wikimedia.org/r/489791 (https://phabricator.wikimedia.org/T214760) (owner: 10Volans) [22:35:27] (03PS4) 10CDanis: icinga: failover to icinga2001 [puppet] - 10https://gerrit.wikimedia.org/r/489791 (https://phabricator.wikimedia.org/T214760) (owner: 10Volans) [22:37:28] (03PS5) 10Paladox: Initial stable-2.16 fork for wikimedia [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/489903 [22:40:35] cdanis: you just run the sync_icinga_state right? [22:40:44] !log icinga1001 now passive T214760 [22:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:47] T214760: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 [22:41:09] I just attempted to run sync_icinga_state on icinga2001 but it is not working [22:41:22] @ERROR: access denied to icinga-tmpfs from icinga2001.wikimedia.org (2620:0:860:3:208:80:153:74) [22:41:51] did the puppet run remove the ferm rule? [22:42:43] so it would seem [22:43:05] add it manually for now, we can revisit puppet code later [22:43:10] (03PS6) 10Paladox: Initial stable-2.16 fork for wikimedia [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/489903 [22:43:22] (03PS1) 10BryanDavis: toolforge: Decommission tools-webgrid-lighttpd-14{01,13,24,26,27,28} [puppet] - 10https://gerrit.wikimedia.org/r/489907 (https://phabricator.wikimedia.org/T187219) [22:43:39] cdanis: I can still see it [22:43:55] I don't see the tcp dport one anymore [22:44:19] but yeah that shouldn't matter with the accept all icinga1001 anywhere rule there? [22:44:58] is 2001 listed as a "partner" in hiera [22:44:59] right is gone [22:45:25] mutante: https://gerrit.wikimedia.org/r/c/operations/puppet/+/489791 [22:45:55] I'm confused as to why running the script on icinga2001 reports access denied *from* 2001 [22:46:07] (03CR) 10Andrew Bogott: [C: 03+1] toolforge: Decommission tools-webgrid-lighttpd-14{01,13,24,26,27,28} [puppet] - 10https://gerrit.wikimedia.org/r/489907 (https://phabricator.wikimedia.org/T187219) (owner: 10BryanDavis) [22:46:32] so the code is like "$partners.each" ..open hole [22:46:54] cdanis: can I try to run the single lines manually? [22:47:23] yeah volans go for it [22:50:20] okay /etc/rsyncd.conf on icinga1001 seems wrong [22:50:30] volans: I am going to fix that in place, okay? [22:50:33] (03CR) 10Bstorm: [C: 03+2] toolforge: Decommission tools-webgrid-lighttpd-14{01,13,24,26,27,28} [puppet] - 10https://gerrit.wikimedia.org/r/489907 (https://phabricator.wikimedia.org/T187219) (owner: 10BryanDavis) [22:50:34] ok [22:50:40] btw we're connecting via ipv6 [22:50:58] i see the iptables rule in both iptables and ip6tables fwiw [22:51:07] destination port rsync [22:51:08] it's not iptables [22:51:12] i looked at strace the connection works [22:51:16] it's the remote rsyncd [22:51:21] ack [22:51:25] rsyncd.conf on icinga1001 had hosts allow = icinga1001 [22:51:33] changes in rsync module that added hosts_allow.. [22:51:37] not very long ago [22:51:47] rerunning sync_icinga_state now [22:51:53] ok finished successfully [22:51:54] ack [22:51:58] nice [22:52:02] (03CR) 10CDanis: [C: 03+2] Failover icinga to icinga2001 [dns] - 10https://gerrit.wikimedia.org/r/489777 (https://phabricator.wikimedia.org/T214760) (owner: 10Volans) [22:53:21] !log icinga.w.o-->icinga2001 DNS change deployed T214760 [22:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:24] T214760: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 [22:54:26] I can see the new icinga [22:54:31] "new" :D [22:54:40] okay [22:54:48] we need to fix whatever is generating that hosts allow line [22:54:52] and probably disable puppet on icinga1001 [22:54:54] awol might be expected as FR might need to get the DNS update [22:54:59] until we do [22:55:06] i was going to mention the FRT thing [22:55:13] we now run the sync in the other direction cdanis [22:55:18] oh right [22:55:24] FR passive checks require FR people to change the IP they are sending to [22:55:27] okay well -- let me check that the rsyncd.conf on 2001 is correct, then! [22:55:32] but 2001 has hosts allow = icinga1001.wikimedia.org [22:55:42] mutante: I think they use the DNS [22:55:48] I see [22:55:51] I don't recall having to ping them on failover [22:56:12] so what needs to happen is that the rsyncd.conf needs to allow hosts from *any* icinga host [22:56:12] but if they don't recover in say ~10 minutes we can ping je.ff [22:56:29] because I don't think we can write an order of the steps properly otherwise [22:56:38] right now we have "Enable and run Puppet on the previously active server to make it passive, check for errors: sudo run-puppet-agent -e "Failover Icinga - $USER"" [22:56:41] right [22:56:46] and *then* "On the previously passive server, run the script to sync Icinga state files: sudo sync_icinga_state" [22:56:56] you can't invert those easily, unless you manually take down icinga on the previously-active server [22:56:57] maybe we can invert those steps? not 100% sure [22:56:58] cwd: ^ fyi [22:57:00] volans: ok, ack [22:57:01] yep [22:57:17] well, the puppet run on the prev-passive server is also what turns off notifications and such [22:57:30] yeah we need that probably before [22:57:38] i don't know enough about icinga internals to understand what is and isn't safe state to copy around [22:57:43] I did wrote those steps empirically though, while doing a failover [22:57:55] i think it was before the hosts allow line existed? [22:58:21] yea, that is relatively new in rsync module [22:58:27] yeah that was before [22:59:07] i think that got introduced along with "autoferm" [22:59:14] that added automatic ferm when using rsync::module [22:59:35] cdanis: do you know what icinga is doing? don't see much on the logs [23:00:16] apart the awol spam ofc ;) [23:00:53] it is using a bunch of CPU and running a bunch of check_nrpes [23:01:03] (hashtag just icinga things) [23:01:15] ehehe [23:01:22] acks and downtimes are there so all seems good [23:01:28] I see the criticals I expect to see [23:01:34] like for cp5010 degraded raid [23:01:35] ok load is going up [23:01:39] probably just rampup time [23:01:44] i pinged cwd about it (fr_tech) [23:02:02] do the passive checks not go to the cname? [23:02:09] just reading scrollback... [23:02:29] cwd: TL;DR do you send the passive checks to icinga.wikimedia.org or an IP? [23:02:36] cwd: i seem to remember there once was a change needed in the sending script when icinga IP changed? [23:02:49] IIRC there was no need of any change on your side on icinga failover [23:03:01] but that might have changed with time... [23:03:08] well we use the dns name, but the firewall will still block by ip [23:03:09] seems it uses the name but nevertheless firewall change needed in fr [23:03:24] cwd: sure but both icinga hosts are allowed IIRC [23:04:00] T211641 [23:04:00] T211641: frack / passive icinga checks: Errors connecting to icinga2001.wikimedia.org - https://phabricator.wikimedia.org/T211641 [23:04:42] checking... [23:07:16] we allow 3 icinga ips [23:07:28] it's still labeled tegmen in here but looks like icinga2001 [23:07:41] then 1001, and einsteinium [23:07:44] 208.80.153.74 is current [23:08:02] and 2620:0:860:3:208:80:153:74 for completeness ;) [23:08:20] remove einsteinium [23:08:24] (later) [23:08:30] ok [23:09:01] does whatever is sending the passive checks need to be restarted to pick up a DNS change? [23:09:03] PROBLEM - Maps - OSM synchronization lag - codfw on icinga2001 is CRITICAL: 7.745e+05 ge 2.592e+05 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [23:09:17] the cname only has a ttl of 5 minutes [23:09:25] cwd: could you check which host does icinga.w.o resolve on the host that sends the passive checks? [23:10:21] volans: 208.80.153.74 [23:10:29] now i think we actually access it by IP anyway [23:10:38] that's correct [23:12:51] jeff said 1 cpu was pegged with icinga [23:12:58] last time it stopped accepting the checks [23:13:15] * cwd considers requesting shell access to icinga server [23:14:16] cwd: I can restart nsca [23:14:18] if that might hel [23:14:20] help [23:14:53] so far that has made it recover, but didn't solve it [23:15:06] cdanis: the client is https://packages.debian.org/stretch/nsca-client [23:15:08] * volans just waiting for puppet to finish to not do them together [23:15:53] oh hey, just saw the icinga-wm message here, nice. [23:15:59] brb 5 minutes [23:16:26] nice nsca fails to restart :( [23:16:31] * volans debugging [23:16:48] Feb 11 23:16:20 icinga2001 nsca[22469]: /etc/init.d/nsca: 428: /lib/lsb/init-functions: Cannot fork [23:17:58] mutante: in the end we didn't get a new host for 2001 right? [23:18:02] volans: it looks like we are reporting to both masters [23:18:06] 1001 and 2001 [23:18:14] it has half the cpu and ram of 1001 :( [23:18:17] PROBLEM - Check systemd state on icinga2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:18:34] kill the nsca process and try to start again? [23:18:39] cwd: ack, that's what I would have expected as we run checks from both [23:18:49] um [23:19:22] mutante: already tried [23:19:24] volans: no, it was still in warranty then [23:19:37] ok cool, so earlier i was seeing fail from 1001 but now it's 2001. maybe it is a performance limit of icinga? [23:20:02] and we have just added enough machines [23:21:04] right now nsca is down and doesn't want to restart at all :( [23:21:17] there are a ton of nsca processes still running in that cgroup [23:21:28] "cannot fork" must be ulimit right? [23:21:34] this seems vaguely familiar [23:21:38] killall [23:22:13] !log T214760 icinga2001% sudo killall nsca [23:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:16] T214760: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 [23:22:42] * cwd doesn't understand where nagios ends and icinga begins [23:22:54] cwd: it's all the same :D [23:23:01] cdanis: 30491 still up? [23:23:15] yes [23:23:16] RECOVERY - Check systemd state on icinga2001 is OK: OK - running: The system is fully operational [23:23:27] cwd: somewhere inside modules/monitoring :p [23:23:27] more are spawning but seem to be getting reaped? [23:23:35] ok so nsca doesn't wont to die basically :D [23:23:42] i am not sure why this thing to receive check results needs to be more than one process [23:23:51] i remember having to do this, i had to hard kill it to be able to start it again [23:24:11] probably a bug? [23:24:21] normally it isnt that many [23:24:23] yeah I saw the 'cannot fork' and then I saw what must have been a hundred processes still output by systemctl status [23:24:27] also congrats to systemd that doesn't fail on stop while not stopping it [23:24:51] AWOL are gone now, cwd [23:25:03] the systemd unit file has been autogenerated from the init.d file [23:25:18] cool , sounds fixed [23:25:29] I might have sponed too soon [23:25:48] but yeah seems recovereing 223 and counting [23:25:55] yay [23:25:59] 213... [23:26:06] give it 5 minutes and should be ok [23:26:49] (03CR) 10Volans: [C: 04-1] "What I meant is not to check the string itself, but have it in Icinga so that when it fails the operator can check what's in it and what f" [puppet] - 10https://gerrit.wikimedia.org/r/489457 (https://phabricator.wikimedia.org/T215457) (owner: 10Paladox) [23:28:33] cdanis, mutante: and now that I think of it... maybe that's the origin of the AWOL [23:29:12] nsca going out to lunch? [23:29:19] seems totally plausible [23:29:24] volans: thanks! [23:29:51] are we the only ones who use nsca? [23:30:20] oh.. duh.. it totally does, yea [23:30:25] yes [23:30:40] on icinga1001 for example now we have 4 processes, maybe when it "forks" for some reason that it shouldn't be it start acting strangely [23:31:00] only thing that I cannot explain is why the downtimes (direct write to the icinga command file socket) fail too [23:31:07] it seems to spawn a bunch of processes (and then reap them) at different times [23:31:10] I am not sure why [23:31:17] but just watching it I have seen different other pids come and go [23:31:29] unless the multiple nsca processes start sending too much stuff to the command file and icinga gives up, dunno [23:31:38] does it spawn a pid per incoming request? [23:31:47] maybe? [23:31:53] and then maybe it gets in a situation where they deadlock --> run out of pids in cgroup --> can't do anything [23:31:54] i see one special case maybe, in the "unknown" category there are 4 kafka checks on host icings2001 [23:32:00] it spawns something but not sure they should be long-lived [23:32:19] that would explain the awols and that would also explain the situation we just saw where it couldn't fork [23:32:25] and there were a jillion processes sitting around [23:32:38] mutante: I've seen those before and clicking on the grafana link goes to one where the codfw prometheus is selected instead of the eqiad one [23:32:55] not sure if they were unknown on 1001 too though [23:33:12] cdanis: yes but how to explain the downtime issue? [23:33:19] those have happened always at the same time [23:33:27] because nsca basically writes there [23:33:35] it deadlocks trying to get an flock on one of the status files? [23:33:42] maybe all the nsca processes hog up the pipe [23:33:43] is just another "client" that writes to that socket [23:33:47] maybe [23:33:48] i think it's fair to call anything restarting fixes a bug [23:33:56] 💯 cwd [23:34:11] 10Operations, 10Analytics, 10WMF-Legal, 10Privacy: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831 (10leila) @Gilles in light of https://www.w3.org/2011/tracking-protection/ shall we decline this task? (Apple already announced that they will remove DNT from Safari). [23:35:04] but is it also a function of frack doing an increased amount of passive checks? [23:35:20] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) Icinga was failovered to `icinga2001`, @Cmjohnson, @RobH we can proceed either to check if the CPU is properly mounted and/or try to get some replacement parts based on cur... [23:35:37] (03CR) 10Paladox: [V: 03+2 C: 03+2] "Builds successfully locally!" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/489903 (owner: 10Paladox) [23:35:48] 10Operations: icinga really needs to check puppet run success of passive icinga hosts - https://phabricator.wikimedia.org/T215848 (10CDanis) [23:35:50] i think no, i think the "fork bomb" happened before on the passive server [23:36:39] do you want to test an outgoing SMS? it would just be sending an email to a random contact though [23:36:58] sure [23:37:03] page me if you like :D [23:37:35] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10RobH) Pulled from racadm getsel ` /admin1-> racadm getsel Record: 1 Date/Time: 05/30/2018 17:49:01 Source: system Severity: Ok Description: Log cleared. ----------... [23:38:16] RECOVERY - Long running screen/tmux on an-coord1001 is OK: OK: SCREEN detected but not long running. [23:38:17] (03PS1) 10Paladox: Add zuul-status PolyGerrit plugin [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/489912 [23:39:26] cdanis: sent [23:39:30] received! [23:40:07] i used from icinga@icinga2001 .. confirmed [23:40:31] I don't receive pages via AQL(?) if that's what you were trying to test, though [23:40:59] arr, true [23:41:13] well, it still tested that mail servers like icinga2001 [23:41:44] aye [23:43:39] mutante: send it to me [23:43:45] is the same process sending the mails that receives the checks? [23:43:49] but you might need to wait 30m for confirmation [23:43:50] cause the mails don't stopworking [23:44:01] my provider is taking ages to rely pages since I've been in the US [23:44:02] (03CR) 10Nuria: superset: add ability for superset to connect to new staging DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489338 (https://phabricator.wikimedia.org/T215680) (owner: 10Nuria) [23:44:19] cwd: what do you mean? [23:44:46] volans: sent [23:44:53] i continue to get icinga mail even when the checks stop working, is nsca a different daemon than icinga? [23:44:54] (03CR) 10Paladox: [V: 03+2 C: 03+2] "plugin builds locally with bazel build plugins/zuul-status" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/489912 (owner: 10Paladox) [23:45:09] cwd: yes, different daemon [23:45:11] cwd: no, icinga sends the mails, but it is getting updates pushed to it by this nsca daemon [23:45:22] ok gotcha [23:45:26] and nsca is what stops responding [23:45:54] ye [23:45:56] yep [23:47:00] is passive checks something you would all like to kill if fundraising wasn't using it? [23:47:07] theoretically it's good to use more to reduce load on the icinga server [23:47:22] we have different opinion on this cwd, some of us would like yes ;) [23:47:24] actually AIUI there are some alerts that _should_ be using them but aren't, cwd [23:47:29] thcipriani hi! looks like you get to do tomorow's train? There is this update to the wmf_deploy branch ready to merge: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralNotice/+/489309/-1..1 [23:47:33] or at least -- that some SREs feel this way about them haha [23:47:35] I'm not on that opinion though [23:48:00] heh, i see i see... [23:48:18] mutante: that makes sense [23:48:19] you can use it to scale icinga to a "master of masters" who just receives passive results from the other servers doing active checks [23:48:24] thcipriani what would be best procedure to smooth the way for deploy given the current unsolved CN deploy headache? (BTW I updated the task about that as we discussed.) [23:48:26] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10RobH) a:03CDanis >>! In T214760#4945774, @Volans wrote: > Icinga was failovered to `icinga2001`, @Cmjohnson, @RobH we can proceed either to check if the CPU is properly mounted a... [23:48:54] mutante: there's also the case of ... I forget which flavor of HW RAID check has a command that takes a very long time to execute [23:49:21] yea, the megacli or the other one [23:50:02] volans: did you get paged? [23:50:49] AndyRussG: that's a good question! I created https://gerrit.wikimedia.org/r/c/mediawiki/tools/release/+/489906 to change branching like we discussed at all-hands. Currently if you merge that change, wmf.16 will also be bumped, so you may want to propose it for a SWAT deploy if that's possible this evening. Then when the branch happens: it'll already be live. [23:50:53] mutante: have to wait 30~35 min for my provider [23:51:31] volans: ok [23:52:45] i need to move to a different place ..inside where it's warm and there is food [23:53:02] (03PS1) 10CDanis: icinga: fix manual sync procedure during failovers [puppet] - 10https://gerrit.wikimedia.org/r/489914 [23:53:06] I'm working on a patch for the sync issue we encountered [23:54:30] AndyRussG: I realized "that change" was ambiguous in what I typed above. "That change" in "currently if you merge that change" refers to the change you've made to CentralNotice. [23:54:47] thcipriani: all good, got it! [23:54:58] cool :) [23:54:58] thcipriani: thanks much btw!! hmmm though it's a few more changes than normally are recommended for a SWAT... What about instead making a patch to revert on the wmf.16 branch, and putting that on the SWAT? Then the change would go out to the groups on wmf.17 like normal train changes [23:55:34] and the SWAT change itself would be a no-op [23:55:55] volans: are you around for long enough to review https://gerrit.wikimedia.org/r/489914 ? no worries if not [23:56:04] (03CR) 10CDanis: "https://puppet-compiler.wmflabs.org/compiler1002/14610/" [puppet] - 10https://gerrit.wikimedia.org/r/489914 (owner: 10CDanis) [23:56:07] dunno if that would work [23:56:14] sure I am cdanis [23:56:25] in any case it's not a huge number of changes, so SWAT is probably totally fine [23:57:37] thcipriani k I'll just do as you recommended ^ and put the update on the evening SWAT in a minute :) [23:57:55] volans: just to understand, "icinga1001 crashed" is probably a different issue than the ncsa proc hanging which does not cause the machine to become unresponsive, is that right? [23:58:16] cwd: yes [23:58:18] cwd: yeah -- we just saw ncsa go out to lunch (and need to be restarted manually) on icinga2001 [23:58:26] AndyRussG: so you're saying you'd want to merge to wmf_deploy, then SWAT a revert for MediaWiki core wmf.16 which would be a noop? Then move forward with the plan to start creating wmf branches in CentralNotice so that it rides the train? [23:58:29] cwd: that is T196336 [23:58:29] T196336: Icinga passive checks go awal and downtime stops working - https://phabricator.wikimedia.org/T196336 [23:58:42] gotcha, thanks [23:59:21] thcipriani: hmm that's what I meant, but just reviewing again the number of changes that are getting merged into the deploy branch, it's really very little [23:59:35] so probably just putting on the upcoming SWAT is simplest and ok [23:59:47] AndyRussG: ah, great, that's simpler for me to think about :) [23:59:53] K heheh me too [23:59:57] I'll SWAT.