[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190305T0000). [00:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:51] I'm the only one in this window so I'll do the sWAT [00:02:13] (03PS1) 10Andrew Bogott: boostrapvz: update buster manifest and install on build hosts [puppet] - 10https://gerrit.wikimedia.org/r/494379 (https://phabricator.wikimedia.org/T216781) [00:03:22] (03PS3) 10Catrope: Enable and configure ORES goodfaith and damaging rcfilters on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494301 (https://phabricator.wikimedia.org/T161628) (owner: 10Sbisson) [00:03:24] (03CR) 10Catrope: [C: 03+2] Enable and configure ORES goodfaith and damaging rcfilters on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494301 (https://phabricator.wikimedia.org/T161628) (owner: 10Sbisson) [00:05:00] (03Merged) 10jenkins-bot: Enable and configure ORES goodfaith and damaging rcfilters on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494301 (https://phabricator.wikimedia.org/T161628) (owner: 10Sbisson) [00:12:42] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable ORES on kowiki (T161628) (duration: 00m 49s) [00:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:45] T161628: Deploy ORES RC Filters in Korean Wikipedia - https://phabricator.wikimedia.org/T161628 [00:14:29] (03CR) 10jenkins-bot: Enable and configure ORES goodfaith and damaging rcfilters on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494301 (https://phabricator.wikimedia.org/T161628) (owner: 10Sbisson) [00:15:04] (03PS2) 10Catrope: Reapply "Enable and configure the ORES goodfaith model on itwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494306 [00:15:30] (03PS3) 10Catrope: Reapply "Enable and configure the ORES goodfaith model on itwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494306 (https://phabricator.wikimedia.org/T211032) [00:15:38] (03CR) 10Catrope: [C: 03+2] Reapply "Enable and configure the ORES goodfaith model on itwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494306 (https://phabricator.wikimedia.org/T211032) (owner: 10Catrope) [00:16:44] (03Merged) 10jenkins-bot: Reapply "Enable and configure the ORES goodfaith model on itwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494306 (https://phabricator.wikimedia.org/T211032) (owner: 10Catrope) [00:17:31] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.19/extensions/GrowthExperiments/includes/HelpPanel.php: Exclude help panel from main page (T215664) (duration: 00m 48s) [00:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:33] T215664: Help panel: show in additional contexts - https://phabricator.wikimedia.org/T215664 [00:26:02] (03CR) 10jenkins-bot: Reapply "Enable and configure the ORES goodfaith model on itwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494306 (https://phabricator.wikimedia.org/T211032) (owner: 10Catrope) [00:36:29] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [00:40:09] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [00:40:58] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable ORES goodfaith on itwiki (T211032) (duration: 00m 47s) [00:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:01] T211032: Enable ORES filters on RC for Italian Wikipedia - https://phabricator.wikimedia.org/T211032 [00:44:51] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.19/extensions/WikimediaEvents/: Redact title/create params and drop page_title in EditorJourney schema (T213974) (duration: 00m 49s) [00:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:54] T213974: EditorJourney records HTML tags in page_title field - https://phabricator.wikimedia.org/T213974 [00:46:11] !log disable unused ports of restbase1016 on asw-a [00:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:05] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [00:50:19] 10Operations, 10Traffic, 10netops: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10ayounsi) [00:50:21] 10Operations, 10netops, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) 05Open→03Resolved Everything here is done. Will reopen if any signs of issues down the road. [01:13:38] !log changing password for "Force de Mots" and "שרית חייט" [01:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:39] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.19/includes/api/ApiBase.php: Logging live patch to debug T217615 (duration: 00m 49s) [01:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:44] T217615: Invalid operand type was used: Invalid type used as key in ApiBase.php - https://phabricator.wikimedia.org/T217615 [01:18:14] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.19/includes/api/ApiBase.php: Logging live patch to debug T217615 (duration: 00m 47s) [01:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:40] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.19/includes/api/ApiBase.php: Logging live patch to debug T217615 (duration: 00m 47s) [01:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:49] T217615: Invalid operand type was used: Invalid type used as key in ApiBase.php - https://phabricator.wikimedia.org/T217615 [01:32:59] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.19/includes/api/ApiBase.php: Logging live patch to debug T217615 (duration: 00m 47s) [01:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:05] T217615: Invalid operand type was used: Invalid type used as key in ApiBase.php - https://phabricator.wikimedia.org/T217615 [01:40:24] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:57:34] PROBLEM - Disk space on stat1004 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Transport endpoint is not connected [02:05:55] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.19/includes/api/ApiBase.php: Logging live patch to debug T217615 (duration: 00m 47s) [02:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:05] T217615: Invalid operand type was used: Invalid type used as key in ApiBase.php - https://phabricator.wikimedia.org/T217615 [02:06:28] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [02:12:58] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:14:02] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 74857 bytes in 0.283 second response time [02:21:56] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.19/includes/api/ApiBase.php: Hot fix for T217615 (duration: 00m 47s) [02:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:11] T217615: Invalid operand type was used: Invalid type used as key in ApiBase.php - https://phabricator.wikimedia.org/T217615 [02:24:44] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1226.eqiad.wmnet, mw1227.eqiad.wmnet, mw1232.eqiad.wmnet, mw1221.eqiad.wmnet, mw1340.eqiad.wmnet, mw1315.eqiad.wmnet, mw1225.eqiad.wmnet, mw1317.eqiad.wmnet, mw1223.eqiad.wmnet are marked down but pooled [02:25:58] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy [02:27:03] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.19/includes/api/ApiBase.php: Revert hot fix (duration: 00m 46s) [02:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:05] kart_: Your horoscope predicts another unfortunate deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190305T0300). [03:03:44] !log Started manual run of unpublished ContentTranslation draft purge script (T217310) [03:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:47] T217310: Run unpublished draft purge script for CX (Week of 03/03) - https://phabricator.wikimedia.org/T217310 [03:05:00] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.19/includes/api/ApiBase.php: Handle TitleBlacklist errors correctly (T217382) (duration: 00m 49s) [03:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:16] T217382: APIEditPage -> ApiBase->checkTitleUserPermissions PHP Warning: Invalid operand type was used: Invalid type used as key - https://phabricator.wikimedia.org/T217382 [04:24:12] RECOVERY - Check systemd state on mw2151 is OK: OK - running: The system is fully operational [04:25:42] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:27:50] PROBLEM - Check systemd state on mw2151 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:36:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:45:16] (03PS2) 10Andrew Bogott: boostrapvz: update buster manifest and install on build hosts [puppet] - 10https://gerrit.wikimedia.org/r/494379 (https://phabricator.wikimedia.org/T216781) [04:48:18] (03CR) 10Andrew Bogott: [C: 03+2] boostrapvz: update buster manifest and install on build hosts [puppet] - 10https://gerrit.wikimedia.org/r/494379 (https://phabricator.wikimedia.org/T216781) (owner: 10Andrew Bogott) [05:56:26] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/494408 (https://phabricator.wikimedia.org/T215231) [06:01:52] (03CR) 10Marostegui: [C: 03+2] dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/494408 (https://phabricator.wikimedia.org/T215231) (owner: 10Marostegui) [06:05:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494409 [06:07:59] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494409 (owner: 10Marostegui) [06:09:05] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494409 (owner: 10Marostegui) [06:10:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 (duration: 00m 51s) [06:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:36] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [06:13:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494409 (owner: 10Marostegui) [06:16:02] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [06:17:16] !log Reload haproxy on dbproxy1010 to depool labsdb1011 [06:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:44] !log Stop MySQL on dbstore2001 to upgrade MySQL [06:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:41] (03CR) 10Marostegui: "This change was left without merging - I merged it" [puppet] - 10https://gerrit.wikimedia.org/r/494379 (https://phabricator.wikimedia.org/T216781) (owner: 10Andrew Bogott) [06:28:54] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.129 second response time [06:29:58] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:33:04] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown) [06:37:32] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.561 second response time [06:38:32] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:41:40] !log Finished manual run of unpublished ContentTranslation draft purge script (T217310) [06:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:45] T217310: Run unpublished draft purge script for CX (Week of 03/03) - https://phabricator.wikimedia.org/T217310 [06:43:53] !log Stop MySQL on db2035 (s2 codfw master) to upgrade MySQL [06:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:07] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10akosiaris) [06:45:28] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10akosiaris) [06:47:42] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494410 [06:49:45] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494410 (owner: 10Marostegui) [06:50:51] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494410 (owner: 10Marostegui) [06:51:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1091 (duration: 00m 48s) [06:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:56] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10akosiaris) I was looking at Special needs or unsorted. @ayounsi I 've updated a few, feel free to move them to other sections. Pinging: * ge-2/0/... [06:55:30] !log Defragment echo_event tables on dbstore1005:3320 T217591 [06:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:33] T217591: Defragment echo_event tables on x1 - https://phabricator.wikimedia.org/T217591 [06:56:53] !log Reboot labsdb1012 [06:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:08] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:00:11] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494410 (owner: 10Marostegui) [07:00:23] 10Operations, 10monitoring: google safe browsing icinga checks sporadic UNKNOWN due to 403 - https://phabricator.wikimedia.org/T216985 (10Dzahn) [07:01:52] 10Operations, 10monitoring: google safe browsing icinga checks sporadic UNKNOWN due to 403 - https://phabricator.wikimedia.org/T216985 (10Dzahn) Tried to identify when exactly we added this. I knew it was many years ago and then found T30898. [07:08:54] !log Start transferring data from labsdb1011 to labsdb1012 - T215231 [07:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:57] T215231: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 [07:11:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494413 [07:15:49] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494413 (owner: 10Marostegui) [07:16:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494413 (owner: 10Marostegui) [07:18:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 (duration: 00m 47s) [07:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:24] !log Stop MySQL on db1095 (backups host) to upgrade MySQL [07:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:30] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494413 (owner: 10Marostegui) [07:25:24] RECOVERY - Disk space on stat1004 is OK: DISK OK [07:39:57] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Hm in that case, it's probably better to follow the docs and per https://salsa.debian.org/pbuilder-team/pbuilder/commit/a60fed7f9f773368c1" [puppet] - 10https://gerrit.wikimedia.org/r/494155 (owner: 10BryanDavis) [07:47:39] !log Upgrade MySQL on db1084 [07:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:03] (03PS1) 10Alexandros Kosiaris: package_builder: Add docs for BUILD_HOME [puppet] - 10https://gerrit.wikimedia.org/r/494419 [07:56:47] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Added some docs in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/494419 to document the configuration required" [puppet] - 10https://gerrit.wikimedia.org/r/494155 (owner: 10BryanDavis) [08:10:57] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494421 [08:12:01] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494421 (owner: 10Marostegui) [08:12:59] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494421 (owner: 10Marostegui) [08:13:17] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494421 (owner: 10Marostegui) [08:14:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1084 (duration: 00m 49s) [08:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:19] (03CR) 10Dzahn: [C: 03+2] "comments only" [puppet] - 10https://gerrit.wikimedia.org/r/494258 (owner: 10Dzahn) [08:15:34] (03PS2) 10Dzahn: xhgui: fix class name in comments [puppet] - 10https://gerrit.wikimedia.org/r/494258 [08:23:54] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=citoid,cluster=scb,name=kubernetes100.* [08:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:52] !log T213194 bump percentage of citoid requests reaching eqiad kubernetes cluster to 9% [08:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:55] T213194: Migrate citoid to kubernetes - https://phabricator.wikimedia.org/T213194 [08:25:38] (03CR) 10Jforrester: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482100 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [08:27:24] (03PS1) 10Dzahn: xhgui: require php-mongodb package [puppet] - 10https://gerrit.wikimedia.org/r/494422 (https://phabricator.wikimedia.org/T180761) [08:29:44] (03PS1) 10Marostegui: db-eqiad.php: Repool db1084 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494423 [08:31:14] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1084 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494423 (owner: 10Marostegui) [08:32:11] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1084 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494423 (owner: 10Marostegui) [08:32:58] !log Optimize echo_event table on x1 codfw master (db2034) this will generate lag on x1 codfw - T217591 [08:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:01] T217591: Defragment echo_event tables on x1 - https://phabricator.wikimedia.org/T217591 [08:33:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 in API (duration: 00m 48s) [08:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:49] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1084 into API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494423 (owner: 10Marostegui) [08:37:28] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494424 [08:39:36] (03PS2) 10Dzahn: xhgui: require php-mongodb package [puppet] - 10https://gerrit.wikimedia.org/r/494422 (https://phabricator.wikimedia.org/T180761) [08:39:38] (03PS1) 10Dzahn: xhgui: setup git cloning and apache site [puppet] - 10https://gerrit.wikimedia.org/r/494425 (https://phabricator.wikimedia.org/T180761) [08:40:53] (03CR) 10Muehlenhoff: [C: 03+1] Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [08:41:05] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494424 (owner: 10Marostegui) [08:42:04] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494424 (owner: 10Marostegui) [08:43:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1084 (duration: 00m 48s) [08:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494426 [08:45:08] (03CR) 10Dzahn: "> No? It's been blocked by SRE for over a month and I've heard no updates that say it's safe to continue with this stack yet. :-(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482100 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [08:45:18] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494426 (owner: 10Marostegui) [08:46:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494426 (owner: 10Marostegui) [08:47:05] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494424 (owner: 10Marostegui) [08:47:07] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494426 (owner: 10Marostegui) [08:47:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1081 (duration: 00m 47s) [08:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:46] (03PS6) 10Dzahn: gerrit: Disable jgit gc [puppet] - 10https://gerrit.wikimedia.org/r/493963 (https://phabricator.wikimedia.org/T217497) (owner: 10Paladox) [08:52:10] (03CR) 10Dzahn: [C: 03+2] "out of caution to prevent possible data loss" [puppet] - 10https://gerrit.wikimedia.org/r/493963 (https://phabricator.wikimedia.org/T217497) (owner: 10Paladox) [08:53:27] (03PS6) 10Sau226: Restore bureaucrat rights on hi.wiktionary to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) [08:54:41] Thanks mutante [08:58:09] !log restarting gerrit to pickup change 493963 - disable jgit gc [08:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:17] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10fgiunchedi) [09:01:25] (03PS3) 10Gilles: Oversample navtiming on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493055 (https://phabricator.wikimedia.org/T187299) [09:02:32] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 3 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [09:02:40] (03CR) 10Gilles: [C: 03+2] Oversample navtiming on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493055 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [09:04:13] (03Merged) 10jenkins-bot: Oversample navtiming on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493055 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [09:04:26] (03CR) 10jenkins-bot: Oversample navtiming on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493055 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [09:06:03] (03CR) 10Gilles: [C: 03+2] "Sigh.. I forgot that I improved the format since, and that's pending '20." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493055 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [09:06:10] (03PS1) 10Gilles: Revert "Oversample navtiming on ruwiki and eswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494447 [09:06:40] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [09:06:44] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Patch-For-Review: Disable jgit gc on gerrit - https://phabricator.wikimedia.org/T217497 (10Dzahn) deployed and service restarted. should be disabled now. [09:06:59] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Patch-For-Review: Disable jgit gc on gerrit - https://phabricator.wikimedia.org/T217497 (10Dzahn) 05Open→03Resolved a:03Dzahn [09:07:17] the eventlog1002 issue above is caused by the gerrit restart and will self-heal [09:07:36] runs puppet there and on vega anyways [09:11:33] (03CR) 10Gilles: [C: 03+2] "No, wait, we're fine, I just need to backport the format improvement :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493055 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [09:11:35] (03Abandoned) 10Gilles: Revert "Oversample navtiming on ruwiki and eswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494447 (owner: 10Gilles) [09:11:48] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:12:52] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:16:09] !log kibana refresh field list [09:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:36] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Some TKOs happened also at around 3:33 UT... [09:19:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494451 [09:21:10] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494451 (owner: 10Marostegui) [09:22:35] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494451 (owner: 10Marostegui) [09:23:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1081 (duration: 00m 47s) [09:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:04] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494452 [09:26:19] !log lvs100[456]: reboot for L1TF kernel/microcode updates T203011 [09:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:34] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494452 (owner: 10Marostegui) [09:28:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494452 (owner: 10Marostegui) [09:29:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103:3312 and db1103:3314 for mysql upgrade (duration: 00m 47s) [09:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:56] !log Stop MySQL on db1103:3312 and db1103:3314 for MySQL upgrade [09:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:34] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494451 (owner: 10Marostegui) [09:33:36] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494452 (owner: 10Marostegui) [09:45:05] 10Operations, 10Analytics, 10vm-requests, 10User-Elukey: Replace analytics-tool1003 ganeti VM with another VM with Buster - https://phabricator.wikimedia.org/T217640 (10elukey) p:05Triage→03Normal [09:45:47] 10Operations, 10Analytics, 10vm-requests, 10User-Elukey: Replace analytics-tool1003 ganeti VM with another VM with Buster - https://phabricator.wikimedia.org/T217640 (10elukey) ` elukey@ganeti1003:~$ sudo gnt-group list Group Nodes Instances AllocPolicy NDParams row_A 4 33 preferred ovs=False... [09:52:14] (03PS1) 10Elukey: Allocate A/AAAA/PTR records for analytics-tool1004 [dns] - 10https://gerrit.wikimedia.org/r/494455 [09:52:34] (03CR) 10jerkins-bot: [V: 04-1] Allocate A/AAAA/PTR records for analytics-tool1004 [dns] - 10https://gerrit.wikimedia.org/r/494455 (owner: 10Elukey) [09:55:26] nice! --^ [09:56:24] (03PS2) 10Elukey: Allocate A/AAAA/PTR records for analytics-tool1004 [dns] - 10https://gerrit.wikimedia.org/r/494455 [09:56:43] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494456 [09:56:49] !log lvs200[456]: upgrade linux to 4.9.144-3.1, reboot for L1TF kernel/microcode updates T203011 [09:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:43] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494456 (owner: 10Marostegui) [09:58:51] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494456 (owner: 10Marostegui) [10:00:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1103:3312 and db1103:3314 after mysql upgrade (duration: 00m 50s) [10:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:37] !log gilles@deploy1001 Synchronized php-1.33.0-wmf.19/extensions/NavigationTiming: T187299 Backport wiki oversampling config syntax change (duration: 00m 48s) [10:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:40] T187299: User-perceived page load performance study - https://phabricator.wikimedia.org/T187299 [10:08:30] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494456 (owner: 10Marostegui) [10:10:50] PROBLEM - Disk space on prometheus2003 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/global 0 MB (0% inode=98%) [10:10:50] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T187299 Oversample navtiming on ruwiki and eswiki (duration: 00m 47s) [10:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:51] (03CR) 10Jforrester: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482100 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [10:14:51] prometheus2003 is me btw [10:15:04] (03CR) 10Gehel: [C: 04-1] "We're missing some monitoring." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [10:15:27] (03PS1) 10Marostegui: db-eqiad.php: Give more traffic to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494457 [10:16:50] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give more traffic to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494457 (owner: 10Marostegui) [10:17:49] (03Merged) 10jenkins-bot: db-eqiad.php: Give more traffic to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494457 (owner: 10Marostegui) [10:18:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1103:3312 and db1103:3314 after mysql upgrade (duration: 00m 47s) [10:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:24] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review, 10User-Elukey: Replace analytics-tool1003 ganeti VM with another VM with Buster - https://phabricator.wikimedia.org/T217640 (10elukey) @akosiaris IIRC there is a bridge + interface for the Analytics VLAN on the Ganeti host, that takes care of... [10:19:58] (03CR) 10jenkins-bot: db-eqiad.php: Give more traffic to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494457 (owner: 10Marostegui) [10:22:41] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review, 10User-Elukey: Replace analytics-tool1003 ganeti VM with another VM with Buster - https://phabricator.wikimedia.org/T217640 (10akosiaris) >>! In T217640#5000872, @elukey wrote: > @akosiaris IIRC there is a bridge + interface for the Analytics... [10:24:30] !log Rump up citoid traffic from k8s to 25% - T213194 [10:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:34] T213194: Migrate citoid to kubernetes - https://phabricator.wikimedia.org/T213194 [10:24:44] (03PS1) 10Jbond: Add pattirck to ldap_only group [puppet] - 10https://gerrit.wikimedia.org/r/494459 [10:25:44] Hmm, anyone exactly sure what "cp1087 pass, cp3041 hit/9, cp3040 hit/7" means? It hit 2 layers of cache? O_o ? [10:27:08] (03PS1) 10Elukey: ganeti: add the Analytics VLAN use case to makevm [puppet] - 10https://gerrit.wikimedia.org/r/494461 (https://phabricator.wikimedia.org/T217640) [10:27:10] (03CR) 10Jbond: [C: 03+2] Add pattirck to ldap_only group [puppet] - 10https://gerrit.wikimedia.org/r/494459 (owner: 10Jbond) [10:27:18] !log jiji@cumin1001 conftool action : set/weight=4; selector: dc=eqiad,service=citoid,cluster=scb,name=kubernetes.* [10:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:55] addshore: no no IIRC it is the state of the last hit to the cp3041's backend, in this case you should have hit the frontend [10:28:02] addshore: https://wikitech.wikimedia.org/wiki/Varnish#X-Cache [10:29:36] ack, yes, it all makes more sense now that I have sorted the requests by sequence... [10:30:10] going from "cp1087 pass, cp3041 hit/1, cp3040 miss" to "cp1087 pass, cp3041 hit/2, cp3040 miss", is there any reason the response size should change? O_o [10:33:28] addshore: I can't think of any. Can you elaborate? [10:33:32] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494462 [10:34:25] ema: I was doing a little investigating for https://phabricator.wikimedia.org/T216006 where apparently the ends of some responses were missing [10:34:38] I had a quick look in the webrequest data for the requests [10:35:09] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494462 (owner: 10Marostegui) [10:35:56] ema: this was the cache hit data and response sizes, which look odd to me https://phabricator.wikimedia.org/P8154 [10:36:05] But wanted to check if I was reading it right [10:36:14] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494462 (owner: 10Marostegui) [10:37:00] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review, 10User-Elukey: Replace analytics-tool1003 ganeti VM with another VM with Buster - https://phabricator.wikimedia.org/T217640 (10elukey) @akosiaris nope I don't feel adventurous today :D I added a change to makevm to support this use case, not s... [10:37:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1103:3312 and db1103:3314 after mysql upgrade (duration: 00m 48s) [10:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:29] (03CR) 10Elukey: [C: 03+2] Allocate A/AAAA/PTR records for analytics-tool1004 [dns] - 10https://gerrit.wikimedia.org/r/494455 (owner: 10Elukey) [10:38:33] (03PS16) 10Gehel: Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [10:39:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: add the Analytics VLAN use case to makevm [puppet] - 10https://gerrit.wikimedia.org/r/494461 (https://phabricator.wikimedia.org/T217640) (owner: 10Elukey) [10:39:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/494461 (https://phabricator.wikimedia.org/T217640) (owner: 10Elukey) [10:39:17] (03CR) 10jerkins-bot: [V: 04-1] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [10:39:22] (03PS2) 10Alexandros Kosiaris: ganeti: add the Analytics VLAN use case to makevm [puppet] - 10https://gerrit.wikimedia.org/r/494461 (https://phabricator.wikimedia.org/T217640) (owner: 10Elukey) [10:39:46] (03PS1) 10Volans: icinga: set Reply-To header to email notifications [puppet] - 10https://gerrit.wikimedia.org/r/494464 [10:42:04] addshore: alright, please feel free to add me to the bug once/if you've identified the caching layer as a potential source of troubles [10:42:10] (03PS3) 10Elukey: ganeti: add the Analytics VLAN use case to makevm [puppet] - 10https://gerrit.wikimedia.org/r/494461 (https://phabricator.wikimedia.org/T217640) [10:42:53] (03CR) 10Elukey: "Added only an extra "\n\n" like the other echos (after the first )" [puppet] - 10https://gerrit.wikimedia.org/r/494461 (https://phabricator.wikimedia.org/T217640) (owner: 10Elukey) [10:42:53] ema: ack, I will write a comment up in the ticket after finishing my current call, looks like it is caching layer, can't think of what else it would be [10:43:05] (03CR) 10Elukey: [C: 03+2] ganeti: add the Analytics VLAN use case to makevm [puppet] - 10https://gerrit.wikimedia.org/r/494461 (https://phabricator.wikimedia.org/T217640) (owner: 10Elukey) [10:43:15] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494462 (owner: 10Marostegui) [10:44:34] kart_: ready for your swat deploy later today? ;) [10:44:51] (03PS17) 10Gehel: Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [10:45:30] (03CR) 10jerkins-bot: [V: 04-1] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [10:45:37] 10Operations, 10Traffic, 10Wikimedia-Planet, 10HTTPS: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543 (10Dzahn) [10:46:48] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-Planet: https://planet.wikimedia.org goes to civiCRM instance. - https://phabricator.wikimedia.org/T41678 (10Dzahn) [10:47:20] (03CR) 10MarcoAurelio: [C: 03+1] Restore bureaucrat rights on hi.wiktionary to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) (owner: 10Sau226) [10:47:58] (03CR) 10Gehel: "PCC agrees that this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/14965/" [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [10:48:09] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review, 10User-Elukey: Replace analytics-tool1003 ganeti VM with another VM with Buster - https://phabricator.wikimedia.org/T217640 (10elukey) Worked nicely! ` elukey@ganeti1003:~$ sudo makevm This is an interactive script to make it easier to create... [10:51:55] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10hashar) contint1001 hosts the CI system subscribing @thcipriani as well It is not clear to me what this operation is about. Is that just about re... [10:52:34] 10Operations, 10Traffic, 10Wikimedia-Planet, 10HTTPS: https://planet.wikimedia.org redirects to http://meta.wikimedia.org/wiki/Planet_Wikimedia - https://phabricator.wikimedia.org/T70554 (10Dzahn) [10:53:49] 10Operations, 10Wikimedia-Planet: puppetize: planet.wikimedia.org - https://phabricator.wikimedia.org/T80359 (10Dzahn) [10:54:12] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494465 [10:55:07] !log gilles@deploy1001 Synchronized php-1.33.0-wmf.19/extensions/NavigationTiming/NavigationTiming.config.php: T187299 Fix wiki oversampling config validation (duration: 00m 48s) [10:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:10] T187299: User-perceived page load performance study - https://phabricator.wikimedia.org/T187299 [10:55:30] 10Operations, 10Wikimedia-Planet: broken / outdated blog feeds - https://phabricator.wikimedia.org/T80305 (10Dzahn) [10:55:35] 10Operations, 10Wikimedia-Planet: broken / outdated blog feeds - https://phabricator.wikimedia.org/T80305 (10Dzahn) [10:55:39] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494465 (owner: 10Marostegui) [10:56:29] 10Operations, 10Wikimedia-Planet: upgrade planet from 1.x-nightly to 2.0-stable - https://phabricator.wikimedia.org/T80518 (10Dzahn) [10:56:32] 10Operations, 10Wikimedia-Planet: upgrade planet from 1.x-nightly to 2.0-stable - https://phabricator.wikimedia.org/T80518 (10Dzahn) [10:56:44] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494465 (owner: 10Marostegui) [10:57:07] 10Operations, 10Wikidata, 10Wikidata-Campsite: Wikidata sometimes cuts off entity RDF - https://phabricator.wikimedia.org/T216006 (10Addshore) I did a little investigation into this this morning. Personally I couldn't reproduce. I couldn't find any indication of mediawiki/wikibase doing anything wrong in lo... [10:57:23] ema: cced you @ https://phabricator.wikimedia.org/T216006#5000980 and wrote a little comment [10:57:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1103:3312 and db1103:3314 after mysql upgrade (duration: 00m 47s) [10:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:58] addshore: thanks! [10:58:08] !log lvs4007/lvs5003: upgrade linux to 4.9.144-3.1, reboot for L1TF kernel/microcode updates T203011 [10:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "We had uploaded more packages than just sury's ones to thirdparty/php72. Be sure to add them as well before merging." [puppet] - 10https://gerrit.wikimedia.org/r/494212 (https://phabricator.wikimedia.org/T216712) (owner: 10Muehlenhoff) [10:58:58] 10Operations, 10Wikidata, 10Wikidata-Campsite, 10User-Addshore: Wikidata sometimes cuts off entity RDF - https://phabricator.wikimedia.org/T216006 (10Addshore) [11:00:58] (03PS1) 10Elukey: Add analytics-tool1004 to site.pp and dhcp [puppet] - 10https://gerrit.wikimedia.org/r/494468 (https://phabricator.wikimedia.org/T217640) [11:01:42] moritzm: --^ - is it good for the buster install or should I keep stretch for the moment? [11:04:36] is that new superset host we discussed yesterday? then buster is fine [11:05:01] yep yep! [11:06:47] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494465 (owner: 10Marostegui) [11:08:44] (03CR) 10Elukey: [C: 03+2] Add analytics-tool1004 to site.pp and dhcp [puppet] - 10https://gerrit.wikimedia.org/r/494468 (https://phabricator.wikimedia.org/T217640) (owner: 10Elukey) [11:08:56] elukey: cool, I'll look into the base install later the day and will ping you when done [11:09:24] moritzm: super, merging the change and then leave the vm for you [11:09:56] ack! [11:10:47] RECOVERY - Disk space on prometheus2003 is OK: DISK OK [11:12:01] (03PS3) 10Jcrespo: mariadb: Change the default arguments for buster [puppet] - 10https://gerrit.wikimedia.org/r/494236 (https://phabricator.wikimedia.org/T161296) [11:12:03] (03PS1) 10Jcrespo: install_server: Reimage db1114 to buster [puppet] - 10https://gerrit.wikimedia.org/r/494469 (https://phabricator.wikimedia.org/T161296) [11:12:23] (03CR) 10Ema: Add analytics-tool1004 to site.pp and dhcp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494468 (https://phabricator.wikimedia.org/T217640) (owner: 10Elukey) [11:16:53] (03CR) 10Marostegui: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/494469 (https://phabricator.wikimedia.org/T161296) (owner: 10Jcrespo) [11:20:32] (03PS4) 10Jcrespo: mariadb: Change the default arguments for buster [puppet] - 10https://gerrit.wikimedia.org/r/494236 (https://phabricator.wikimedia.org/T161296) [11:20:34] (03PS2) 10Jcrespo: install_server: Reimage db1114 to buster [puppet] - 10https://gerrit.wikimedia.org/r/494469 (https://phabricator.wikimedia.org/T161296) [11:22:07] <_joe_> !log uploading new scap packages , T217611 [11:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:10] T217611: Deploy scap 3.9.2-1 - https://phabricator.wikimedia.org/T217611 [11:24:15] (03PS18) 10Mathew.onipe: Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [11:27:02] (03PS3) 10Jcrespo: install_server: Reimage db1114 to buster [puppet] - 10https://gerrit.wikimedia.org/r/494469 (https://phabricator.wikimedia.org/T161296) [11:27:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] toolforge: Rewrite envelope From headers when relaying [puppet] - 10https://gerrit.wikimedia.org/r/494291 (https://phabricator.wikimedia.org/T213416) (owner: 10BryanDavis) [11:28:03] (03PS2) 10Alexandros Kosiaris: toolforge: Rewrite envelope From headers when relaying [puppet] - 10https://gerrit.wikimedia.org/r/494291 (https://phabricator.wikimedia.org/T213416) (owner: 10BryanDavis) [11:28:05] (03CR) 10Mathew.onipe: [C: 03+1] Add support for elasticsearch 6 [puppet] - 10https://gerrit.wikimedia.org/r/493234 (https://phabricator.wikimedia.org/T217196) (owner: 10DCausse) [11:28:07] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] toolforge: Rewrite envelope From headers when relaying [puppet] - 10https://gerrit.wikimedia.org/r/494291 (https://phabricator.wikimedia.org/T213416) (owner: 10BryanDavis) [11:29:29] 10Operations, 10ops-requests, 10Wikimedia-Planet: Russian Planet Wikimedia not updating - https://phabricator.wikimedia.org/T81279 (10Dzahn) [11:29:34] 10Operations: wmf-auto-restart occasionally errors on fuse mounts - https://phabricator.wikimedia.org/T217646 (10jbond) [11:29:43] 10Operations: wmf-auto-restart occasionally errors on fuse mounts - https://phabricator.wikimedia.org/T217646 (10jbond) p:05Triage→03Normal [11:31:13] 10Operations: wmf-auto-restart occasionally errors on fuse mounts - https://phabricator.wikimedia.org/T217646 (10MoritzMuehlenhoff) is that reproducible? Otherwise this might be caused by the stability issues we see with hdfs/fuse in general. [11:31:38] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10elukey) @RobH, @EBernhardson - while we wait for a response from AMD, I'd also like to understand if T216528 gave us more info about the possibility of ordering a GPU like RX vega 64 via re... [11:35:16] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/492390 (owner: 10Herron) [11:39:22] zeljkof: Let's see :) [11:39:28] (03PS12) 10Filippo Giunchedi: rsyslog: change udp_localhost_compat to define, add mwlog_compat [puppet] - 10https://gerrit.wikimedia.org/r/492390 (https://phabricator.wikimedia.org/T126989) (owner: 10Herron) [11:40:41] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10elukey) About Analytics nodes: * ge-1/0/7 - kafka-jumbo1001 -> Kafka needs to be stopped ~10/15 minutes beforehand to have a graceful shutdown (... [11:40:57] kart_: I'm here to help, no worries :) [11:41:17] <_joe_> zeljkof: I want to update scap on the deployment servers, let me know when you're done with SWAT [11:42:05] _joe_: it starts in 20 minutes, do you want to do it before swat? or after? [11:42:37] <_joe_> zeljkof: if that's ok with you, it's a simple bugfix by tyler [11:42:41] <_joe_> I'll install it now then [11:42:56] _joe_: go ahead, we'll test it during swat then :) [11:43:18] <_joe_> !log installing new swat version on deployment servers, T217611 [11:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:21] T217611: Deploy scap 3.9.2-1 - https://phabricator.wikimedia.org/T217611 [11:44:38] <_joe_> zeljkof: I'm doing my usual noop deployment [11:44:52] !log oblivian@deploy1001 Synchronized README: Test deploy for new scap version (duration: 00m 48s) [11:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:38] <_joe_> ok you're gtg [11:45:50] <_joe_> !log installing new scap version in codfw [11:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:47] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10elukey) ` ge-6/0/25 - mc1019 ge-6/0/26 - mc1020 ge-6/0/27 - mc1021 ge-6/0/28 - mc1022 ge-6/0/29 - mc1023 ` The above ones are holding the eqiad m... [11:48:15] (03PS1) 10Mathew.onipe: elasticsearch: move nagios check to profile [puppet] - 10https://gerrit.wikimedia.org/r/494471 (https://phabricator.wikimedia.org/T214921) [11:48:28] zeljkof: cool. Let's break things! :) [11:49:07] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: move nagios check to profile [puppet] - 10https://gerrit.wikimedia.org/r/494471 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:49:14] (03PS11) 10KartikMistry: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123) [11:50:31] (03PS1) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/494472 (https://phabricator.wikimedia.org/T197873) [11:51:28] (03CR) 10jerkins-bot: [V: 04-1] icinga: add notes URLs to various monitoring checks, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/494472 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [11:52:17] !log lvs400[56]: upgrade linux to 4.9.144-3.1, reboot for L1TF kernel/microcode updates T203011 [11:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:37] thanks _joe_ [11:54:47] (03PS2) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/494472 (https://phabricator.wikimedia.org/T197873) [11:55:35] (03CR) 10jerkins-bot: [V: 04-1] icinga: add notes URLs to various monitoring checks, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/494472 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [11:57:02] (03PS4) 10Jcrespo: install_server: Reimage db1114 to buster [puppet] - 10https://gerrit.wikimedia.org/r/494469 (https://phabricator.wikimedia.org/T161296) [11:57:06] 10Operations: wmf-auto-restart occasionally errors on fuse mounts - https://phabricator.wikimedia.org/T217646 (10jbond) This is reproducible but not reliably, some file operation taking part on fuse e.g. ls -la /mnt/hdfs/tmp seem to cause lsof to fail. its is almost certainly to do with hdfs fuse stability iss... [11:57:24] zeljkof: where is documentation about deploy on canary servers for config code? https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Change_wiki_configuration doesn't say about it.. [11:57:46] kart_: you should follow this page https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [11:59:35] <_joe_> !log upgrading scap everywhere to 3.9.2-1, T217611 [11:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:38] T217611: Deploy scap 3.9.2-1 - https://phabricator.wikimedia.org/T217611 [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190305T1200). [12:00:04] kart_ and WQL: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:18] OK. I'm here. [12:00:22] <_joe_> zeljkof: gimme 1 minute sorry [12:00:23] (03PS1) 10Elukey: Assign role::analytics_cluster::superset to analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/494473 (https://phabricator.wikimedia.org/T212243) [12:00:30] <_joe_> I hoped it would be done before swat [12:01:02] <_joe_> zeljkof: nevermind, I'm done :D [12:01:10] (03PS1) 10Dzahn: icinga/elasticsearch: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494474 (https://phabricator.wikimedia.org/T197873) [12:01:12] (03PS2) 10Elukey: Assign role::analytics_cluster::superset to analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/494473 (https://phabricator.wikimedia.org/T212243) [12:01:46] (03CR) 10jerkins-bot: [V: 04-1] icinga/elasticsearch: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494474 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [12:01:53] zeljkof: So, I'm ready. First, +2 on patch, right? [12:02:10] kart_: yes, first review and merge the patch [12:02:11] Stand-by. [12:02:28] zeljkof: already +1ed. So, I'm +2'ng. [12:02:59] WQL: please stand by, you're next, kart_ is deploying (probably) for the first time (during swat), I'm helping him [12:03:07] okay [12:03:11] zeljkof: .. after long time! [12:03:11] (03PS3) 10Elukey: Assign role::analytics_cluster::superset to analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/494473 (https://phabricator.wikimedia.org/T212243) [12:03:26] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry) [12:04:13] (03PS2) 10Dzahn: icinga/elasticsearch: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494474 (https://phabricator.wikimedia.org/T197873) [12:04:15] (03CR) 10Jcrespo: [C: 03+2] install_server: Reimage db1114 to buster [puppet] - 10https://gerrit.wikimedia.org/r/494469 (https://phabricator.wikimedia.org/T161296) (owner: 10Jcrespo) [12:05:06] (03Merged) 10jenkins-bot: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry) [12:05:13] kart_: you didn't merge/deploy anything yet, right? [12:05:21] (03CR) 10jenkins-bot: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493155 (https://phabricator.wikimedia.org/T216123) (owner: 10KartikMistry) [12:05:24] zeljkof: no. [12:05:27] found an unreported error in logs, just checking [12:05:32] (03PS2) 10Mathew.onipe: elasticsearch: move nagios check to profile [puppet] - 10https://gerrit.wikimedia.org/r/494471 (https://phabricator.wikimedia.org/T214921) [12:05:34] ok, I'll report it [12:05:34] zeljkof: waiting for merge. [12:05:40] zeljkof: what is that? [12:05:53] `Catchable fatal error: Argument 1 passed to Wikibase\Rdf\RdfBuilder::addEntityRedirect() must be an instance of Wikibase\DataModel\Entity\EntityId, null given` [12:06:02] (03CR) 10Muehlenhoff: "Approach looks fine, some comments inline" (034 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 (owner: 10Jbond) [12:06:14] zeljkof: no related, I guess. [12:06:47] zeljkof: OK. Patch is merged. And, I'm on mwdebug1002. [12:07:27] kart_: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#operations/mediawiki-config (at deploy 1001) [12:07:58] zeljkof: not 1002? :) [12:08:17] no, deploy1001 [12:08:34] (03CR) 10Dzahn: [C: 03+1] elasticsearch: move nagios check to profile [puppet] - 10https://gerrit.wikimedia.org/r/494471 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:08:36] it's mwdebug1002, if you think about that, different machine [12:09:06] zeljkof: I mean I want to test on mwdebug. [12:09:23] (03CR) 10Mathew.onipe: [C: 03+1] "Thanks for this!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494474 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [12:09:27] zeljkof: for testing. [12:09:30] 10Operations, 10Wikimedia-Mailing-lists: Close the grwp-wici mailing list - https://phabricator.wikimedia.org/T217247 (10jbond) 05Open→03Resolved a:03jbond This list has now been removed Thanks John [12:09:32] first you have to fetch/rebase on deployment machine, then scap pull on mwdebug [12:09:38] just follow the instructions :) [12:09:53] OK :) [12:11:49] (03PS2) 10Volans: icinga: add check_icinga script [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (https://phabricator.wikimedia.org/T217599) [12:12:02] zeljkof: OK. Done fetching/rebase. [12:12:12] zeljkof: now on mwdebug1002, right? [12:12:19] (03CR) 10Volans: "Refactored a bunch of things to include support for recovery notifications and awake hours for pagers." (034 comments) [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (https://phabricator.wikimedia.org/T217599) (owner: 10Volans) [12:12:37] kart_: yes https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Canary [12:12:49] zeljkof: Do I need to go to any directory or just run scap there? [12:13:04] no directory, just run [12:13:15] OK [12:13:37] the docs would specify the directory, if needed [12:15:51] zeljkof: OK. We have issue. How do I revert change in mwdebug? [12:16:06] (03CR) 10Mathew.onipe: "PCC is happy! https://puppet-compiler.wmflabs.org/compiler1002/14966/" [puppet] - 10https://gerrit.wikimedia.org/r/494471 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:16:24] kart_: you don't revert there, you revert at deploy1001, then `scap pull at mwdebug` [12:16:43] kart_: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Reverting [12:16:57] OK. Reading. [12:17:03] if it's not causing an outage, you can revert in gerrit [12:17:26] that's what I usually do, but in case of trouble it's faster to revert at deploy1001 [12:18:10] yeah. going on deploy1001 first and then submit new patch. [12:18:17] so, revert in gerrit (there is revert button), fetch/rebase at deploy1001 [12:19:56] 10Operations, 10Mail: Please create talkpageconsultation@wikimedia.org email alias - https://phabricator.wikimedia.org/T217590 (10Dzahn) Hi @TBolliger, there is an effort to move these kinds of aliases away from SRE and to OIT (T122144). It would be great if you could have them create this as an alias on the... [12:20:30] zeljkof: reverted at deploy1001. Now revert at Gerrit? [12:20:57] kart_: did you follow this? https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Reverting [12:21:27] if so, the last step is how to push to gerrit [12:21:53] so, no need to do anything _in_ gerrit [12:22:58] zeljkof: OK. Doing as instructions.. [12:23:53] kart_: when in doubt, follow the docs ;) [12:24:00] yep. [12:24:18] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert gerrit:493155 (duration: 00m 49s) [12:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:08] zeljkof: What's last step? [12:25:14] you@deploy1001:[FOLDER]$ git push origin HEAD:refs/for/[BRANCH]/revert-[SHA1] [12:25:24] What should be in BRANCH? [12:25:37] master? [12:25:52] you're reverting config repo, right? [12:26:17] kart_: yeah, master, see https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/493155 [12:26:39] if it were core/extension/skin, it would be a wmf branch [12:27:08] (I usually revert in gerrit UI for config changes, so I had to refresh my memory) [12:27:13] (03CR) 10Addshore: [C: 04-1] "missing wgScoreTrim which we determined is needed during testing?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493010 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [12:27:18] (03CR) 10Addshore: [C: 04-1] "missing wgScoreTrim which we determined is needed during testing?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493011 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [12:28:09] zeljkof: It failed to send gerrit patch.. [12:28:21] uh oh [12:28:35] paste terminal output to phab [12:28:46] OK [12:28:52] https://phabricator.wikimedia.org/paste/ [12:28:56] Looks ling change-ID is not added. [12:29:09] hm [12:29:12] strange [12:29:40] zeljkof: https://phabricator.wikimedia.org/P8157 [12:30:52] I added gitdir, but now permission denied. [12:31:06] kart_: ok, looking at deploy1001 [12:31:15] gitdit? [12:32:14] zeljkof: as git suggested. [12:32:24] kart_: ok, I've amended the commit, looks like I have the hook installed, so now the commit has the change-id [12:32:26] try pushing now [12:32:44] (03CR) 10Mathew.onipe: [C: 04-1] "this is probably wrong. I should refactor this to align with icinga::monitor::elasticsearch like gehel said." [puppet] - 10https://gerrit.wikimedia.org/r/494471 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:32:56] (03PS1) 10KartikMistry: Revert "Enable edittag for ExternalGuidance in CX and VE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494475 [12:33:05] kart_: strange, can you share the output of the second error? [12:33:09] zeljkof: ^ [12:34:18] zeljkof: So, we need to merge this and deploy or just merge revert change? [12:34:43] (03CR) 10Addshore: [C: 04-1] "missing wgScoreTrim which we determined was needed during testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493009 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [12:34:50] kart_: merge the revert, then deploy it [12:36:01] zeljkof: OK +2'ng. [12:36:05] (03CR) 10KartikMistry: [C: 03+2] Revert "Enable edittag for ExternalGuidance in CX and VE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494475 (owner: 10KartikMistry) [12:37:01] zeljkof: meanwhile I'll prepare new patch. [12:37:03] (03Merged) 10jenkins-bot: Revert "Enable edittag for ExternalGuidance in CX and VE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494475 (owner: 10KartikMistry) [12:37:16] (03CR) 10jenkins-bot: Revert "Enable edittag for ExternalGuidance in CX and VE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494475 (owner: 10KartikMistry) [12:38:00] 10Operations, 10Mail: Please create talkpageconsultation@wikimedia.org email alias - https://phabricator.wikimedia.org/T217590 (10Aklapper) @Dzahn: That was also my understanding, which made me wonder if https://wikitech.wikimedia.org/wiki/SRE_Team_requests#Mail_aliases needs clarification which would allow fo... [12:39:20] 10Operations, 10Mail: Please create talkpageconsultation@wikimedia.org email alias - https://phabricator.wikimedia.org/T217590 (10Dzahn) @Aklapper Already discussed on #wikimedia-clinic on IRC and done just now :) https://wikitech.wikimedia.org/w/index.php?title=SRE_Team_requests&type=revision&diff=1818415&ol... [12:40:46] (03PS1) 10KartikMistry: Enable edittag for ExternalGuidance in CX and VE [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494477 (https://phabricator.wikimedia.org/T216123) [12:40:56] zeljkof: revert patch merged. [12:41:05] kart_: ok, now deploy it [12:41:21] zeljkof: ok. following earlier steps as usual. [12:41:26] yes [12:43:13] zeljkof: OK. mwdebug is good. [12:43:44] ok [12:44:32] scap'ng [12:44:53] emm no log message? [12:45:03] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert "Enable edittag for ExternalGuidance in CX and VE" (duration: 00m 48s) [12:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:08] (03PS7) 10Jbond: Add ability to filter out auto restarts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 [12:45:11] Slow logbot. [12:45:27] zeljkof: done. I guess enough learning for the day :) [12:45:44] kart_: are you deploying the fix? [12:45:49] or giving up for today? [12:46:01] zeljkof: I think it need some more testing, so not today. [12:46:14] (03CR) 10Jbond: "all comments address, thanks" (034 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/493463 (owner: 10Jbond) [12:46:20] zeljkof: on Thursday for sure. I can deploy myself now.. [12:46:40] kart_: ok, do you want to practice more and deploy the next patch? ;) [12:47:18] zeljkof: sure. Let's look at it. [12:47:23] WQL: around? [12:47:29] yes [12:48:58] zeljkof: ah. This require running script. Can you take it? [12:49:07] I'll be watching. [12:49:10] kart_: sure, but you can run the script too :) [12:49:22] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Maintenance_scripts [12:49:28] (03PS3) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/494472 (https://phabricator.wikimedia.org/T197873) [12:49:32] zeljkof: so, how one can test this? [12:49:44] kart_: the patch? WQL will test it [12:50:01] zeljkof: there is no way via mwdebug, right? [12:50:32] kart_: for what? for testing the patch? [12:50:46] I don't know, it's a question for WQL :) [12:51:16] 10Operations, 10Icinga, 10monitoring: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API) - https://phabricator.wikimedia.org/T116099 (10Dzahn) [12:51:20] 10Operations, 10monitoring: google safe browsing icinga checks sporadic UNKNOWN due to 403 - https://phabricator.wikimedia.org/T216985 (10Dzahn) [12:51:50] (03PS6) 10GTirloni: wmcs::nfs::misc - Fixes and backup role [puppet] - 10https://gerrit.wikimedia.org/r/494195 (https://phabricator.wikimedia.org/T209527) [12:51:51] In fact I don't know how to test it either, but I know that if you merge it and run the script, the number of articles in zhwikiuniv. will have a very big increase. [12:51:57] About 5000+ [12:52:00] kart_: we're short on time, are you deploying or me? [12:52:08] zeljkof: go ahead. [12:52:13] (03PS2) 10Alaa Sarhan: labs: Enable musical notation datatype in wikidatawiki in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493009 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [12:52:17] kart_: ok, I'll take over swat [12:52:32] WQL: so no way to test this at mwdebug, I should deploy and run the script? [12:52:43] yes please go ahead [12:52:50] per concensus locally [12:53:18] 10Operations: Fix "google safe browsing" Nagios checks - https://phabricator.wikimedia.org/T80182 (10Dzahn) [12:54:19] (03PS7) 10Zfilipin: Set wgArticleCountMethod='any' for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115 (owner: 10Ammarpad) [12:54:31] WQL: ok, I'll let you know when I'm done [12:54:47] thx [12:55:19] hi there, to have this on beta only https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/493009/ [12:55:19] do I still need to request a SWAT slot? [12:55:30] (03PS8) 10Zfilipin: Set wgArticleCountMethod='any' for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115 (https://phabricator.wikimedia.org/T214946) (owner: 10Ammarpad) [12:56:11] (03CR) 10Zfilipin: "PS8 fixes typo in the commit message." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115 (https://phabricator.wikimedia.org/T214946) (owner: 10Ammarpad) [12:56:15] (03CR) 10Zfilipin: [C: 03+2] Set wgArticleCountMethod='any' for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115 (https://phabricator.wikimedia.org/T214946) (owner: 10Ammarpad) [12:56:29] (03PS3) 10Muehlenhoff: Switch app servers to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/494212 (https://phabricator.wikimedia.org/T216712) [12:57:22] (03Merged) 10jenkins-bot: Set wgArticleCountMethod='any' for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115 (https://phabricator.wikimedia.org/T214946) (owner: 10Ammarpad) [12:58:52] (03PS7) 10GTirloni: wmcs::nfs::misc - Fixes and backup role [puppet] - 10https://gerrit.wikimedia.org/r/494195 (https://phabricator.wikimedia.org/T209527) [12:58:53] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:58:54] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:487115|Set wgArticleCountMethod=any for zhwikiversity (T214946)]] (duration: 00m 49s) [12:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:57] T214946: Set $wgArticleCountMethod='any' in zhwikiversity - https://phabricator.wikimedia.org/T214946 [12:59:05] WQL: it's deployed, running script [12:59:14] ok [12:59:58] WQL: which script should I run? [13:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190305T1300) [13:00:06] I saw it somewhere, can't find it now? [13:00:32] zeljkof: mwscript namespaceDupes.php zhwikiversity --fix [13:00:32] hashar: extending swat for 5 or so minutes, to finish the last patch [13:00:41] kart_: thanks! [13:00:43] zeljkof: on deployment calendar.. [13:00:48] zeljkof: no worries [13:00:59] ah, I knew I saw it somewhere :) [13:01:33] Oh, I'm running scripts daily and thought it is different from SWAT. It is same :D [13:02:00] kart_: it's the same :) [13:02:07] (03PS1) 10Dzahn: icinga: add notes_url for Google Safe Browsing checks [puppet] - 10https://gerrit.wikimedia.org/r/494483 (https://phabricator.wikimedia.org/T216985) [13:02:07] WQL: hm, it only finds two pages? [13:02:17] anyway, I'll paste the output to the task [13:02:44] Should I reschedule my patch? [13:02:55] (03PS8) 10GTirloni: wmcs::nfs::misc - Fixes and backup role [puppet] - 10https://gerrit.wikimedia.org/r/494195 (https://phabricator.wikimedia.org/T209527) [13:03:12] WQL: T214946#5001314 [13:03:13] oh gosh... [13:03:16] please run this [13:03:20] updateArticleCount.php --update [13:03:30] ah, yes, that makes more sense :) [13:03:30] (03CR) 10jenkins-bot: Set wgArticleCountMethod='any' for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115 (https://phabricator.wikimedia.org/T214946) (owner: 10Ammarpad) [13:03:33] I am getting wrong with someting... [13:03:33] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10jcrespo) Hi, @Cmjohnson The remote IPMI password was out of sync. Just mentioning to add it on the to do list for motherboard changes (this and reviewing the boot order, w... [13:03:40] (03CR) 10KartikMistry: "> @KartikMistry there are plans to use a new define for mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [13:03:55] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Fixes and backup role [puppet] - 10https://gerrit.wikimedia.org/r/494195 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [13:04:37] 10Operations, 10DBA, 10Patch-For-Review, 10User-fgiunchedi: Upgrade mysqld_exporter in production - https://phabricator.wikimedia.org/T161296 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1114.eqiad.wmnet'] ` The log can be found in `/var/log... [13:04:39] 10Operations, 10Gerrit: Intermittent slowness on gerrit - https://phabricator.wikimedia.org/T217457 (10jbond) 05Open→03Resolved a:03jbond paladox confirmed via IRC this can be closed [13:05:02] WQL: so this? `mwscript updateArticleCount.php zhwikiversity --update` [13:05:07] yes [13:05:23] (03PS4) 10Muehlenhoff: Switch app servers to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/494212 (https://phabricator.wikimedia.org/T216712) [13:05:25] 10Operations, 10monitoring, 10Patch-For-Review: google safe browsing icinga checks sporadic UNKNOWN due to 403 - https://phabricator.wikimedia.org/T216985 (10Dzahn) p:05Triage→03Normal [13:07:03] 10Operations, 10Gerrit: Intermittent slowness on gerrit - https://phabricator.wikimedia.org/T217457 (10jcrespo) @Mathew.onipe maybe relevant to you ^ [13:07:40] (03CR) 10Muehlenhoff: [C: 03+2] Switch app servers to component/php72 [puppet] - 10https://gerrit.wikimedia.org/r/494212 (https://phabricator.wikimedia.org/T216712) (owner: 10Muehlenhoff) [13:07:48] WQL: done https://phabricator.wikimedia.org/T214946#5001339 [13:07:57] !log EU SWAT finished [13:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:04] hashar: I'm done, thanks :) [13:08:05] (03PS2) 10Dzahn: icinga: add notes_url for Google Safe Browsing checks [puppet] - 10https://gerrit.wikimedia.org/r/494483 (https://phabricator.wikimedia.org/T216985) [13:08:29] kart_: congratulations on your first swat! ;) (or at least the first in a while) [13:08:45] WQL: please check the pages and thanks for deploying with #releng :) [13:08:57] thx and have a good day :-D [13:09:49] kart_: with the powers given to me by #releng, I pronounce you an official swat deployer! ;) https://phabricator.wikimedia.org/people/badges/106/ [13:09:52] 10Operations, 10Traffic: Indexing of https://www.wikidata.org in the Yandex Search Engine - https://phabricator.wikimedia.org/T217407 (10jbond) p:05Triage→03Normal [13:10:19] kart_: remember, with great power comes great responsibility :) [13:10:50] congratulations! [13:11:04] (03CR) 10Dzahn: [C: 03+2] icinga: add notes_url for Google Safe Browsing checks [puppet] - 10https://gerrit.wikimedia.org/r/494483 (https://phabricator.wikimedia.org/T216985) (owner: 10Dzahn) [13:11:17] (03PS3) 10Dzahn: icinga: add notes_url for Google Safe Browsing checks [puppet] - 10https://gerrit.wikimedia.org/r/494483 (https://phabricator.wikimedia.org/T216985) [13:11:33] 10Operations, 10Gerrit: Intermittent slowness on gerrit - https://phabricator.wikimedia.org/T217457 (10Mathew.onipe) @jcrespo probably... but its all fine now. Thanks! [13:12:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494484 (https://phabricator.wikimedia.org/T217591) [13:13:16] zeljkof: :) [13:14:26] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494484 (https://phabricator.wikimedia.org/T217591) (owner: 10Marostegui) [13:14:48] !log lvs500[12]: upgrade linux to 4.9.144-3.1, reboot for L1TF kernel/microcode updates T203011 [13:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:40] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494484 (https://phabricator.wikimedia.org/T217591) (owner: 10Marostegui) [13:15:44] (03CR) 10Addshore: [C: 03+1] labs: Enable musical notation datatype in wikidatawiki in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493009 (https://phabricator.wikimedia.org/T216730) (owner: 10Ladsgroup) [13:15:50] 10Operations: Bugzilla4 post upgrade fixes - https://phabricator.wikimedia.org/T79402 (10Dzahn) [13:15:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494484 (https://phabricator.wikimedia.org/T217591) (owner: 10Marostegui) [13:16:11] sau226: sorry, did not notice your patch, please move it to another swat window [13:16:41] (03CR) 10Zfilipin: "We ran out of time today during EU SWAT, please move the patch to another window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) (owner: 10Sau226) [13:16:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1064 T217591 (duration: 00m 48s) [13:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:49] T217591: Defragment echo_event tables on x1 - https://phabricator.wikimedia.org/T217591 [13:17:23] 10Operations, 10ops-esams: knsq9 has hardware errors - https://phabricator.wikimedia.org/T80486 (10Dzahn) [13:18:15] 10Operations, 10ops-requests: reboot fenari due to leap second bug - https://phabricator.wikimedia.org/T81243 (10Dzahn) [13:20:10] (03PS3) 10Mathew.onipe: elasticsearch: move nagios check to profile [puppet] - 10https://gerrit.wikimedia.org/r/494471 (https://phabricator.wikimedia.org/T214921) [13:21:38] (03CR) 10Mathew.onipe: [C: 04-1] "This is correct. aligning with icinga::monitor::elasticsearch will make this a bit hard to apply to multiple nodes." [puppet] - 10https://gerrit.wikimedia.org/r/494471 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [13:22:06] (03CR) 10Gehel: "This is a good first step. We should still extract a class which groups all the checks (probably `icinga::monitor::elasticsearch::checks`)" [puppet] - 10https://gerrit.wikimedia.org/r/494471 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [13:24:14] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Socket timeout on wdqs.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T217557 (10jbond) p:05Triage→03Normal [13:25:40] 10Operations, 10LDAP-Access-Requests: Add bmansurov to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T217447 (10jbond) @DarTar can you please authorize this request [13:28:44] 10Operations, 10LDAP-Access-Requests: Add bmansurov to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T217447 (10Reedy) >>! In T217447#5001391, @jbond wrote: > @DarTar can you please authorize this request @DarTar is no longer with the foundation (as of 15th Feb), so he's possibly not the co... [13:31:53] !log Cutting branch wmf/1.33.0-wmf.20 # T206674 [13:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:56] T206674: 1.33.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T206674 [13:32:31] 10Operations, 10LDAP-Access-Requests: Add bmansurov to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T217447 (10jbond) @Reedy thanks was going from https://office.wikimedia.org/wiki/Contact_list, i need authorization from bmansurov manager. Looking at https://wikimediafoundation.org/role/sta... [13:33:44] (03PS1) 10Dzahn: icinga/restbase/eventbus: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494485 (https://phabricator.wikimedia.org/T197873) [13:34:36] (03CR) 10jerkins-bot: [V: 04-1] icinga/restbase/eventbus: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494485 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [13:47:43] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational [14:00:05] hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - European version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190305T1400). [14:08:21] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494486 [14:09:21] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494486 (owner: 10Marostegui) [14:10:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494486 (owner: 10Marostegui) [14:10:31] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494486 (owner: 10Marostegui) [14:10:41] 10Operations, 10DBA, 10Patch-For-Review, 10User-fgiunchedi: Upgrade mysqld_exporter in production - https://phabricator.wikimedia.org/T161296 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1114.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1114.eqiad.wmnet'] ` [14:12:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1064 T217591 (duration: 01m 50s) [14:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:23] T217591: Defragment echo_event tables on x1 - https://phabricator.wikimedia.org/T217591 [14:13:46] (03PS1) 10Gehel: lgostash: correct version number for logstash [puppet] - 10https://gerrit.wikimedia.org/r/494488 (https://phabricator.wikimedia.org/T216052) [14:14:34] !log Applied wmf/1.33.0-wmf.20 local patches # T206674 [14:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:37] T206674: 1.33.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T206674 [14:15:30] (03PS1) 10Hashar: Group0 to 1.33.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494489 (https://phabricator.wikimedia.org/T206674) [14:17:24] !log hashar@deploy1001 scap failed: LockFailedError Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "hashar"; reason is "Pruned MediaWiki: 1.33.0-wmf.14" (duration: 00m 00s) [14:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:44] (03CR) 10CDanis: "> LGTM. It would be nice to also:" [puppet] - 10https://gerrit.wikimedia.org/r/490404 (https://phabricator.wikimedia.org/T215183) (owner: 10CDanis) [14:17:50] (03PS3) 10CDanis: partman: grub-install on all RAID{1,10} drives [puppet] - 10https://gerrit.wikimedia.org/r/490404 (https://phabricator.wikimedia.org/T215183) [14:20:19] !log otto@deploy1001 scap-helm eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad,codfw] [14:20:19] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [14:20:19] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [14:20:19] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:25] haha [14:20:29] we hvea got to fix that! [14:20:33] i'm just listing status, getting help [14:21:27] (03CR) 10CDanis: [C: 03+2] partman: grub-install on all RAID{1,10} drives [puppet] - 10https://gerrit.wikimedia.org/r/490404 (https://phabricator.wikimedia.org/T215183) (owner: 10CDanis) [14:21:41] (03CR) 10Gehel: [C: 03+2] "LGTM" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493460 (https://phabricator.wikimedia.org/T216993) (owner: 10Mathew.onipe) [14:21:54] (03CR) 10Gehel: [V: 03+2 C: 03+2] Upgrade logstash plugins to 5.6.14 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/493460 (https://phabricator.wikimedia.org/T216993) (owner: 10Mathew.onipe) [14:25:49] !log hashar@deploy1001 Pruned MediaWiki: 1.33.0-wmf.14 (duration: 09m 47s) [14:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:58] (03CR) 10Mathew.onipe: [C: 03+1] "just one comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494488 (https://phabricator.wikimedia.org/T216052) (owner: 10Gehel) [14:27:20] (03PS1) 10Dzahn: icinga/toollabs: set notes URLs for toolforge related checks [puppet] - 10https://gerrit.wikimedia.org/r/494490 (https://phabricator.wikimedia.org/T197873) [14:27:55] !log hashar@deploy1001 Started scap: testwiki to php-1.33.0-wmf.20 and rebuild l10n cache # T206674 [14:27:55] !log hashar@deploy1001 scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.JrfRQw0oDJ" --verbose' returned non-zero exit status 1 (duration: 00m 21s) [14:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:58] T206674: 1.33.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T206674 [14:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:05] (03CR) 10jerkins-bot: [V: 04-1] icinga/toollabs: set notes URLs for toolforge related checks [puppet] - 10https://gerrit.wikimedia.org/r/494490 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [14:31:33] (03PS1) 10Ottomata: scap-helm - don't !log if no COMMAND given [puppet] - 10https://gerrit.wikimedia.org/r/494492 [14:32:54] 10Operations: make mchenry pain bandaged - https://phabricator.wikimedia.org/T81236 (10Dzahn) [14:33:14] !log jiji@cumin1001 conftool action : set/weight=5; selector: dc=codfw,service=citoid,cluster=scb,name=kubernetes.* [14:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:50] !log hashar@deploy1001 Started scap: testwiki to php-1.33.0-wmf.20 and rebuild l10n cache # T206674 [14:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:53] T206674: 1.33.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T206674 [14:34:12] !log hashar@deploy1001 scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.ngh6XIMz8y" --verbose' returned non-zero exit status 1 (duration: 00m 21s) [14:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:37] !log Rump up citoid traffic from k8s to 25% on codfw - T213194 [14:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:40] T213194: Migrate citoid to kubernetes - https://phabricator.wikimedia.org/T213194 [14:34:56] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review, 10User-Elukey: Replace analytics-tool1003 ganeti VM with another VM with Buster - https://phabricator.wikimedia.org/T217640 (10elukey) I just realized that in https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions we have 'an-to... [14:35:11] !log hashar@deploy1001 Started scap: testwiki to php-1.33.0-wmf.20 and rebuild l10n cache # T206674 [14:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:32] !log hashar@deploy1001 scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.BRPBtKvzZH" --verbose' returned non-zero exit status 1 (duration: 00m 20s) [14:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:20] (03CR) 10Paladox: [C: 03+1] "Bump" [puppet] - 10https://gerrit.wikimedia.org/r/493317 (https://phabricator.wikimedia.org/T217287) (owner: 10Thcipriani) [14:38:45] (03CR) 10Dzahn: "is it mirrors.jenkins.io or pkg.jenkins.io per upstream saying "I recommend pointing to https://pkg.jenkins.io which may be more tolerable" [puppet] - 10https://gerrit.wikimedia.org/r/485685 (owner: 10Hashar) [14:39:09] (03PS3) 10Gehel: icinga/elasticsearch: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494474 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [14:40:27] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=citoid,cluster=scb,name=kubernetes.* [14:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:00] (03CR) 10Gehel: [C: 03+1] "I've changed the links to something that I think make more sense. I tried to point directly to the relevant section of the doc, but in som" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494474 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [14:41:11] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production stable/eventgate-analytics -f eventgate-analytics-codfw-values.yaml [namespace: eventgate-analytics, clusters: codfw] [14:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:13] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [14:41:13] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:39] (03PS2) 10Gehel: logstash: correct version number for logstash [puppet] - 10https://gerrit.wikimedia.org/r/494488 (https://phabricator.wikimedia.org/T216052) [14:41:56] (03CR) 10Gehel: logstash: correct version number for logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494488 (https://phabricator.wikimedia.org/T216052) (owner: 10Gehel) [14:42:03] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) @aaron if you have time can you explain w... [14:42:06] (03PS3) 10Gehel: logstash: correct version number for logstash [puppet] - 10https://gerrit.wikimedia.org/r/494488 (https://phabricator.wikimedia.org/T216052) [14:42:33] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production stable/eventgate-analytics -f eventgate-analytics-eqiad-values.yaml [namespace: eventgate-analytics, clusters: eqiad] [14:42:34] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [14:42:34] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:09] (03CR) 10Gehel: [C: 03+2] logstash: correct version number for logstash [puppet] - 10https://gerrit.wikimedia.org/r/494488 (https://phabricator.wikimedia.org/T216052) (owner: 10Gehel) [14:43:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, very good job." [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [14:44:51] !log hashar@deploy1001 Started scap: testwiki to php-1.33.0-wmf.20 and rebuild l10n cache # T206674 [14:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:53] T206674: 1.33.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T206674 [14:45:01] blabla stuff broken [14:45:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Skip logging 'aux' messages from Docker [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/494238 (owner: 10Hashar) [14:45:43] (03CR) 10jenkins-bot: Skip logging 'aux' messages from Docker [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/494238 (owner: 10Hashar) [14:49:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] scap-helm - don't !log if no COMMAND given [puppet] - 10https://gerrit.wikimedia.org/r/494492 (owner: 10Ottomata) [14:49:43] (03PS28) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [14:50:50] (03CR) 10Jbond: [C: 03+2] Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [14:50:51] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Papaul) @Gehel Since yesterday there is no error reported in the log. Can you repool the server so we can monitor it while it is under load. Thanks. [14:50:58] (03CR) 10Ottomata: [C: 03+2] scap-helm - don't !log if no COMMAND given [puppet] - 10https://gerrit.wikimedia.org/r/494492 (owner: 10Ottomata) [14:51:05] (03PS2) 10Ottomata: scap-helm - don't !log if no COMMAND given [puppet] - 10https://gerrit.wikimedia.org/r/494492 [14:51:15] (03CR) 10Ottomata: [V: 03+2 C: 03+2] scap-helm - don't !log if no COMMAND given [puppet] - 10https://gerrit.wikimedia.org/r/494492 (owner: 10Ottomata) [14:51:33] oo puppet merge clash [14:51:39] jbond42: I can merge yours? [14:51:43] CI chckes? [14:52:18] !log reprepro added bdsync_0.10-1+deb9u1 T209527 [14:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:21] T209527: Set up scratch and maps NFS services on cloudstore1008/9 - https://phabricator.wikimedia.org/T209527 [14:52:36] jbond42: hope so! merging. [14:52:38] ottomata: sorry just merged it [14:52:43] ohok great [14:52:52] danke [14:53:13] didn't see any other waiting though [15:05:09] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review: Improve CI checks to cover more of the code base - https://phabricator.wikimedia.org/T215275 (10jbond) 05Open→03Resolved [15:17:24] (03CR) 10Dzahn: [C: 03+2] "thanks Gehel! separate anchors in the same page is also what i had in mind for the ideal structure. of course the content can be written a" [puppet] - 10https://gerrit.wikimedia.org/r/494474 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [15:17:54] (03PS1) 10Mathew.onipe: elasticsearch: refactor icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921) [15:19:07] (03CR) 10Mathew.onipe: [C: 03+1] icinga/elasticsearch: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494474 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [15:19:24] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [15:20:18] (03PS4) 10Dzahn: icinga/elasticsearch: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494474 (https://phabricator.wikimedia.org/T197873) [15:21:25] (03PS2) 10Mathew.onipe: elasticsearch: refactor icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921) [15:25:07] 10Operations, 10VisualEditor, 10Performance-Team (Radar), 10Software-Licensing: New MongoDB version is not DFSG-compatible, dropped by Debian - https://phabricator.wikimedia.org/T213996 (10Krinkle) [15:26:18] (03PS1) 10Urbanecm: New throttle rule for Czech Senior Citizen Write Wikipedia course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494500 (https://phabricator.wikimedia.org/T217663) [15:28:00] (03CR) 10Mathew.onipe: "PCC is still happy :)" [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [15:31:27] (03PS1) 10Ottomata: Sync /srv/published-datasets from SWAP hosts [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) [15:32:05] (03CR) 10jerkins-bot: [V: 04-1] Sync /srv/published-datasets from SWAP hosts [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) (owner: 10Ottomata) [15:35:12] (03PS2) 10Ottomata: Sync /srv/published-datasets from SWAP hosts [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) [15:35:54] !log hashar@deploy1001 Finished scap: testwiki to php-1.33.0-wmf.20 and rebuild l10n cache # T206674 (duration: 51m 03s) [15:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:57] T206674: 1.33.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T206674 [15:36:05] (03PS2) 10Dzahn: icinga/restbase/eventbus: add notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/494485 (https://phabricator.wikimedia.org/T197873) [15:36:07] (03CR) 10jerkins-bot: [V: 04-1] Sync /srv/published-datasets from SWAP hosts [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) (owner: 10Ottomata) [15:47:26] that takes a while :/ [15:48:03] (03CR) 10Hashar: [C: 03+2] Group0 to 1.33.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494489 (https://phabricator.wikimedia.org/T206674) (owner: 10Hashar) [15:49:01] (03Merged) 10jenkins-bot: Group0 to 1.33.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494489 (https://phabricator.wikimedia.org/T206674) (owner: 10Hashar) [15:49:31] (03CR) 10jenkins-bot: Group0 to 1.33.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494489 (https://phabricator.wikimedia.org/T206674) (owner: 10Hashar) [15:53:17] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.33.0-wmf.20 [15:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:00] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10EBernhardson) At this point I'm not really expecting a response from AMD anymore, generally i would expect under a week but the only response I've gotten so far is the automated response sa... [15:57:28] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10elukey) Thanks @EBernhardson! @RobH What do you think? Would it be feasible for you to check from our vendors if we can get a RX Vega 64? [16:00:25] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10RobH) So the only vendor we have terms with that may have this is https://neweggbusiness.com https://www.neweggbusiness.com/product/productlist.aspx?Submit=ENE&DEPA=0&Order=BESTMATCH&N=-1... [16:03:36] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10Nuria) I am ok with the $800 cost of this proof of concept, benefit in terms of engineering hours if this is to work is pretty significant [16:03:46] (03PS5) 10Jcrespo: mysqld-prometheus-exporter: Change the default arguments for buster [puppet] - 10https://gerrit.wikimedia.org/r/494236 (https://phabricator.wikimedia.org/T161296) [16:06:36] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10elukey) Those cards afaics are only 8G, meanwhile we'd need 16G (if possible). The only model that would suit us that I found is: https://www.neweggbusiness.com/product/product.aspx?item=9... [16:06:54] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10EBernhardson) Newegg looks to only have models with aftermarket cooling in stock,I would have some size concerns with these as well. Ideally we want the stock form factor, like the (out of... [16:07:49] (03PS1) 10GTirloni: wmcs::nfs::misc - Fix sshd config [puppet] - 10https://gerrit.wikimedia.org/r/494505 (https://phabricator.wikimedia.org/T209527) [16:08:27] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Fix sshd config [puppet] - 10https://gerrit.wikimedia.org/r/494505 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:10:46] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10EBernhardson) >>! In T216226#5001885, @elukey wrote: > Those cards afaics are only 8G, meanwhile we'd need 16G (if possible). The only model that would suit us that I found is: > > https:/... [16:11:02] (03PS1) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) [16:11:51] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Gehel) @Papaul server has been repooled since it restarted. Let's blame cosmic rays until we prove otherwise? [16:12:29] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10RobH) Can you guys (as a team) decide on a GPU card and provide me the exact model and a url for purchase? All pricing should technically take place in the S4 space, so I created a sub tas... [16:12:48] (03PS6) 10Jcrespo: mysqld-prometheus-exporter: Change the default arguments for buster [puppet] - 10https://gerrit.wikimedia.org/r/494236 (https://phabricator.wikimedia.org/T161296) [16:13:08] (03PS1) 10GTirloni: block_sync - Adjust SSH private key filename [puppet] - 10https://gerrit.wikimedia.org/r/494508 (https://phabricator.wikimedia.org/T209527) [16:13:35] (03CR) 10jerkins-bot: [V: 04-1] acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [16:14:29] (03CR) 10GTirloni: [C: 03+2] block_sync - Adjust SSH private key filename [puppet] - 10https://gerrit.wikimedia.org/r/494508 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:15:11] (03PS7) 10Jcrespo: mysqld-prometheus-exporter: Change the default arguments for buster [puppet] - 10https://gerrit.wikimedia.org/r/494236 (https://phabricator.wikimedia.org/T161296) [16:16:42] (03PS2) 10Vgutierrez: acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) [16:17:01] PROBLEM - graphite.wikimedia.org render on graphite1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.061 second response time [16:17:17] PROBLEM - graphite.wikimedia.org api on graphite1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.061 second response time [16:18:28] (03CR) 10jerkins-bot: [V: 04-1] acme-chief-api: Add support for puppet HTTP API search operation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [16:19:07] Graphite seems to have gone down, I get a bunch of 502 Bad Gateway responses [16:19:14] (same as icinga-wm, in other words) [16:19:35] (03CR) 10Jcrespo: [C: 03+1] "This confirms this works as expected: https://puppet-compiler.wmflabs.org/compiler1002/14972/dbstore2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/494236 (https://phabricator.wikimedia.org/T161296) (owner: 10Jcrespo) [16:20:01] I’ll try restarting uwsgi there [16:20:12] (03PS5) 10Alexandros Kosiaris: Introduce cxserver helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/492301 (https://phabricator.wikimedia.org/T213195) [16:20:30] !log restarting uwsgi-graphite-web on graphite1004 [16:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:29] (03CR) 10Jcrespo: [C: 03+1] "A wmcs database is affected, but note worse (unlikely) case scenario, prometheus metrics collecting goes down." [puppet] - 10https://gerrit.wikimedia.org/r/494236 (https://phabricator.wikimedia.org/T161296) (owner: 10Jcrespo) [16:21:49] (03PS2) 10Dzahn: icinga/toollabs: set notes URLs for toolforge related checks [puppet] - 10https://gerrit.wikimedia.org/r/494490 (https://phabricator.wikimedia.org/T197873) [16:21:51] RECOVERY - graphite.wikimedia.org render on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1569 bytes in 0.066 second response time [16:22:07] RECOVERY - graphite.wikimedia.org api on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.088 second response time [16:23:36] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [16:24:10] (03CR) 10Gehel: [C: 04-1] Add wdqs data transfer cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [16:24:32] it’s back, thanks herron [16:24:48] (I was basically completely redundant here, but still ^^) [16:25:20] sure np, though still looking a bit [16:25:35] at https://grafana.wikimedia.org/d/000000020/graphite-eqiad?refresh=1m&orgId=1&from=now-1h&to=now [16:26:19] disk utilization in particular [16:27:06] looks to be cooling off now, will keep an eye [16:30:57] I’m getting 502s again [16:31:02] 10Operations, 10Mail: Please create talkpageconsultation@wikimedia.org email alias - https://phabricator.wikimedia.org/T217590 (10TBolliger) 05Open→03Invalid OK, I emailed techsupport@. Thank you! [16:31:27] PROBLEM - graphite.wikimedia.org render on graphite1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.295 second response time [16:31:43] PROBLEM - graphite.wikimedia.org api on graphite1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.061 second response time [16:32:05] (it’s slightly evil how, when editing a dashboard, the 502s aren’t really visible, Grafana will just keep showing you an outdated view of the dashboard you’re editing) [16:33:41] (03CR) 10Gehel: [C: 04-1] "See comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/494499 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [16:39:05] RECOVERY - graphite.wikimedia.org api on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.071 second response time [16:40:01] RECOVERY - graphite.wikimedia.org render on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1569 bytes in 0.089 second response time [16:41:53] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [16:42:53] (03PS1) 10GTirloni: wmcs::nfs::misc - Fix roles [puppet] - 10https://gerrit.wikimedia.org/r/494510 (https://phabricator.wikimedia.org/T209527) [16:43:18] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging stable/eventgate-analytics -f eventgate-analytics-staging-values.yaml [namespace: eventgate-analytics, clusters: staging] [16:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:20] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:43:20] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:47] now I’m not getting any responses from Graphite anymore [16:43:51] the POST just seems to hang indefinitely [16:43:57] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10RobH) So we aren't 100% certain from photos on T216528 if the left hand side PCIe riser has one or two slots on the back of the chassis. Chris will update T216528 with a photo to show the... [16:44:51] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban: confirm gpu form factor in stat1005 - https://phabricator.wikimedia.org/T216528 (10RobH) Chris, As discussed in IRC, we want to know how many slots are on this system's rear chassis. (Not slots in the riser, but mounting slots on the back.) Ple... [16:45:12] 10Operations, 10Patch-For-Review: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10CDanis) [16:47:27] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Fix roles [puppet] - 10https://gerrit.wikimedia.org/r/494510 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:48:33] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:49:18] (03PS1) 10Dzahn: icinga: add notes URLs to various monitoring checks, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/494511 [16:49:40] now one of them responded with 503 Service Unavailable [16:49:45] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:50:06] not sure why icinga-wm isn’t seeing this [16:50:20] (cc herron, not sure if you’re looking at IRC at the moment) [16:51:11] having another look [16:52:34] graphite1004 kernel: [10179701.956141] oom_reaper: reaped process 184953 (uwsgi), now anon-rss:0kB, file-rss:0kB, shmem-rss:124kB [16:52:46] restarting again, though could use another set of eyes to look at this closer [16:52:55] !log restarting uwsgi-graphite-web on graphite1004 [16:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:04] no alerts for graphite1004 at all on icinga [16:54:25] !log imported logstash 1:5.6.14-1 to thirdparty/elastic56 [16:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:28] ^ gehel [16:54:54] moritzm: thanks! I'm curious, what did you do? [16:55:39] I ran "reprepro --noskipold --component thirdparty/elastic56 update stretch-wikimedia" and that correctly imported it [16:55:47] did you use a different command before? [16:56:16] or maybe pulled component/elastic56 instead? [16:58:28] 10Operations, 10ops-eqiad: Update several hosts status in Netbox - https://phabricator.wikimedia.org/T217429 (10ayounsi) https://netbox.wikimedia.org/dcim/devices/396/ an-master1001 is also status "planned" but an active master server. [16:59:40] (03CR) 10Krinkle: [C: 03+1] xhgui: require php-mongodb package [puppet] - 10https://gerrit.wikimedia.org/r/494422 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [17:00:04] godog and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190305T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:14] (03CR) 10Krinkle: [C: 03+1] "In following with https://wikitech.wikimedia.org/wiki/Performance/Runbook/Puppet_patches," [puppet] - 10https://gerrit.wikimedia.org/r/494422 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [17:00:32] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [17:04:17] (03CR) 10Vgutierrez: [C: 04-1] acme-chief-api: Add support for puppet HTTP API search operation (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/494506 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [17:04:59] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Jdforrester-WMF) [17:07:29] 10Operations, 10ops-eqiad, 10Cognate, 10Growth-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) Thank you all for the quick replies! >>! In T187960#5000934, @hashar wrote: > Is that just about re cabling the server from a switch to... [17:07:36] (03CR) 10Krinkle: [C: 03+1] "Clean run at https://puppet-compiler.wmflabs.org/compiler1002/14974/." [puppet] - 10https://gerrit.wikimedia.org/r/494422 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [17:08:05] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Jdforrester-WMF) [17:08:15] (03CR) 10Krinkle: [C: 03+1] "Puppet agent run at beta:" [puppet] - 10https://gerrit.wikimedia.org/r/494422 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [17:08:45] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Papaul) p:05High→03Normal [17:11:12] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10EBernhardson) If only one slot is confirmed, I'm not sure there is a way forward. As far as i can tell even the datacenter gpu's (MI10, MI25) require the second slot. If dual slot is confi... [17:11:39] 10Operations, 10Graphite: Graphite returning server errors (out of memory?) - https://phabricator.wikimedia.org/T217679 (10Lucas_Werkmeister_WMDE) [17:12:00] herron: ^ I created a Phabricator task for the Graphite issue, perhaps it’s useful… [17:12:56] 10Operations, 10Graphite: Graphite returning server errors (out of memory?) - https://phabricator.wikimedia.org/T217679 (10Lucas_Werkmeister_WMDE) [17:13:02] thanks Lucas_WMDE [17:13:44] (03CR) 10CRusnov: "Ahh unfortunately it doesn't like '*:0/15' as a timer specifier." [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [17:14:34] 10Operations, 10Graphite: Graphite returning server errors (out of memory?) - https://phabricator.wikimedia.org/T217679 (10herron) Looking here https://grafana.wikimedia.org/d/000000020/graphite-eqiad?refresh=1m&orgId=1&from=now-3h&to=now disk utilization has increased significantly [17:14:53] useful indeed, thanks Lucas_WMDE [17:15:07] I’ll leave you to it then, good luck :) [17:19:10] thanks [17:19:24] herron: any insight/idea so far ? [17:19:35] cc cdanis ^ [17:20:04] not yet, though wondering why iowait is high-ish [17:20:04] there are a bunch of uwsgi processes spinning on 100% CPU and using tons of RAM [17:20:06] I think this is new [17:20:40] that was the case as well when icinga alerted, bouncing uwsgi helped temporarily [17:21:11] and also oom killer has been killing uwsgi processes [17:21:20] that would make sense if the issue is we're getting long-running very-expensive queries, or that the code running in uwsgi is getting stuck somehow [17:22:09] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10Nuria) +1 to 16G test [17:22:35] seeing IndexError: list index out of range [17:22:43] in /var/log/graphite-web/exception.log [17:23:18] https://www.irccloud.com/pastebin/dVXB5kM1/ [17:23:57] ack, thanks yeah it does seem like a particularly heavy query [17:25:03] hm [17:25:09] !log restarting uwsgi-graphite-web on graphite1004 [17:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:11] hum… could that be my query in particular? the one with lots of aliasSub() calls? [17:25:14] that would be embarassing [17:25:25] uwsgi only logs at the end of the query and not at the end of the query eh? [17:26:55] (I’ve been working with https://grafana.wikimedia.org/d/000000170/wikidata-edits?refresh=1m&orgId=1&panelId=2&fullscreen, which uses a bunch of aliasSub() to turn OAuth CIDs into consumer names) [17:27:05] Lucas_WMDE: are you running queries against wikidata.rc.edits.oauth ? [17:27:10] yes [17:27:30] and some of those aliasSub()s were probably bad, removing or adding the wrong number of dots [17:27:33] if that confused aliasNode()… [17:27:44] *aliasByNode [17:28:20] I should stop editing that page then, so that I don’t keep sending render requests [17:28:34] * Lucas_WMDE closed the tab [17:28:51] I cannot say for sure that those are the bad queries [17:29:36] yeah it is fairly hard unfortunately to pinpoint the bad queries in cases like this :| [17:29:59] in fact T116767 has been open forever and still is :| [17:29:59] T116767: limit the impact of heavy/large graphite queries - https://phabricator.wikimedia.org/T116767 [17:30:23] 10Operations, 10Analytics, 10vm-requests, 10Patch-For-Review, 10User-Elukey: Replace analytics-tool1003 ganeti VM with another VM with Buster - https://phabricator.wikimedia.org/T217640 (10elukey) To keep archives happy: @MoritzMuehlenhoff is currently testing the Buster debian installer for Ganeti VMs,... [17:34:00] these are almost certainly long-running queries. I really wish uwsgi would log at the start of a query, or if there was an easy way to figure out what query a given pid was running [17:34:32] even after the latest restart we have three workers that are 'out to lunch' and have consumed 7 minutes+ of CPU usage, have a big RSS, and are just mmap()ing I can't tell what [17:35:33] a request timeout in apache for the affected endpoint might help? [17:37:13] it would stem some of the bleeding [17:37:18] 10Operations: Document service owner in Netbox - https://phabricator.wikimedia.org/T217686 (10ayounsi) p:05Triage→03Low [17:37:38] 10Operations: Document service owner in Netbox - https://phabricator.wikimedia.org/T217686 (10ayounsi) [17:37:50] yeah, any timeout is better than no timeout, even just a sanity-limit at e.g. 60s or 120s [17:38:35] it has not happened today that a query on graphite1004 took longer than 15 seconds and actually completed successfully [17:38:57] (I'm generally a fan of the idea that beyond some small value, users will give up anyways and there's no point having long user-facing timeouts. the problem in translating this to infrastructural timeouts is we also have internal async stuff and API requests where user-facing rules don't necessarily apply, which could get broken) [17:39:37] but since this is graphite and not mediawiki, then yeah maybe something even shorter, I donno :) [17:41:12] indeed, worth a try for sure [17:41:53] I find it very frustrating I have not been able to find a way to look at what request is executing on those three uwsgi workers [17:42:18] in the meantime I'm for testing the theory that it was indeed wikidata queries and request cancellation isn't really a thing and bouncing uwsgi one more time, thoughts? [17:43:39] sounds good to me [17:43:53] godog: go for it [17:44:30] !log bounce uwsgi on graphite1004 [17:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:49] I believe I have done the thing, btw [17:45:11] https://phabricator.wikimedia.org/P8159 should be visible to just wmf-nda because it probably has Lucas_WMDE's IP in it [17:45:43] I’m curious from apache perspective how long these requests take to serve [17:45:50] it would probably just be the WMDE office IP, but thanks :) [17:45:57] but off hand our log format there doesn’t include that, could be missing it [17:45:59] it does not have the actual graphite query being executed [17:46:01] but [17:46:03] "HTTP_REFERER=https://grafana.wikimedia.org/d/000000170/wikidata-edits?refresh=1m&orgId=1&panelId=2&fullscreen&edit", [17:46:04] (03PS12) 10Giuseppe Lavagetto: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) [17:46:11] in the vars section of one of the stuck workers [17:46:33] that sure sounds like the board I was working with [17:46:37] (that one had something like 10 minutes of CPU time consumed, still spinning at 100%, and several gigabytes of RSS) [17:47:24] also FTR that paste was generated with: sudo uwsgi --connect-and-read /run/uwsgi/graphite-web-stats.sock |& phaste [17:47:56] <_joe_> the phaste use there is kinda nice, heh [17:48:08] I only learned about it last week :) [17:48:20] very nice to have around [17:48:36] (03CR) 10jerkins-bot: [V: 04-1] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [17:48:45] <_joe_> yes but I usually save to file, then send to phaste if I need to, because I'm lazy [17:49:02] nice indeed, worth adding to https://wikitech.wikimedia.org/wiki/Graphite [17:49:05] oh, that sounds neat [17:49:24] will do godog [17:51:36] ok seems like things are going back to 'normal' [17:53:05] (03PS1) 10BryanDavis: toolforge: Rewrite envelope From headers when relaying [puppet] - 10https://gerrit.wikimedia.org/r/494515 (https://phabricator.wikimedia.org/T213416) [17:53:07] (03PS1) 10BryanDavis: toolforge: remove obsolete mailrelay manifests [puppet] - 10https://gerrit.wikimedia.org/r/494516 (https://phabricator.wikimedia.org/T208843) [17:53:38] 10Operations, 10LDAP-Access-Requests: Add bmansurov to archiva-deployers LDAP group - https://phabricator.wikimedia.org/T217447 (10leila) approved. [17:53:52] (03CR) 10jerkins-bot: [V: 04-1] toolforge: remove obsolete mailrelay manifests [puppet] - 10https://gerrit.wikimedia.org/r/494516 (https://phabricator.wikimedia.org/T208843) (owner: 10BryanDavis) [17:57:14] ok I'll do a bit of task grooming, there's at least a couple of things we could do to mitigate similar scenarios [17:57:57] (03CR) 10BryanDavis: "TODO: check the webgrid nodes for a competing service used by dynamicproxy" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [17:59:09] should I try to open the board again? (without editing for now) [17:59:19] I hope the existing queries weren’t problematic [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190305T1800). [18:00:16] Lucas_WMDE: yes please [18:00:28] (03CR) 10CRusnov: "Updates pending." (0314 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [18:00:34] okay, loading it seems to be working [18:00:39] I’ll comment on Phabricator with some more info [18:01:16] godog: herron: how do you feel about setting some uwsgi options about ... probably set evil-reload-on-rss to like a gig or so? [18:01:40] the usual worker RSS is like 40-60MB [18:01:53] cdanis: definitely +1 [18:02:59] yeah makes sense [18:02:59] Lucas_WMDE: thanks! sorry about the disruption/unreliability, will followup on task and T116767 [18:03:00] T116767: limit the impact of heavy/large graphite queries - https://phabricator.wikimedia.org/T116767 [18:04:57] gonna add a 60 seconds max duration as well [18:08:12] looking at the logs on graphite1004 the two longest queries are one 43 second query and one 21 second query, and then lots of <=20 seconds (mostly clustered around 12-13s), in the past two weeks [18:09:07] 10Operations, 10Graphite: Graphite returning server errors (out of memory?) - https://phabricator.wikimedia.org/T217679 (10Lucas_Werkmeister_WMDE) It seems this was actually caused by me – I was editing the [wikidata-edits](https://grafana.wikimedia.org/dashboard/db/wikidata-edits) board, specifically the [OAu... [18:16:17] PROBLEM - puppet last run on mc1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:16:27] (03CR) 10Elukey: Sync /srv/published-datasets from SWAP hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) (owner: 10Ottomata) [18:22:17] (03PS1) 10GTirloni: wmcs::nfs - Refactor snapshot_manager [puppet] - 10https://gerrit.wikimedia.org/r/494519 (https://phabricator.wikimedia.org/T209527) [18:24:23] (03PS1) 10Smalyshev: Enable WikibaseCirrusSearch loading on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494524 (https://phabricator.wikimedia.org/T217276) [18:25:53] (03CR) 10Ottomata: Sync /srv/published-datasets from SWAP hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) (owner: 10Ottomata) [18:26:34] (03PS3) 10Ottomata: Sync /srv/published-datasets from SWAP hosts [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) [18:26:37] ottomata: I am going to finish the review sorry, my brain is a little melted at this time of the day :D [18:27:35] (03CR) 10jerkins-bot: [V: 04-1] Sync /srv/published-datasets from SWAP hosts [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) (owner: 10Ottomata) [18:28:07] godog/herron/cdanis: sorry, I think I just sent another evil request :( [18:28:17] I thought I’d fixed my query but apparently not [18:28:22] haha, looks like yes [18:28:24] one is fine [18:28:32] I should really just stop doing things with this board until that 60s limit is in place [18:28:40] yeah I'm working on that now [18:29:00] (03PS2) 10GTirloni: wmcs::nfs - Refactor snapshot_manager [puppet] - 10https://gerrit.wikimedia.org/r/494519 (https://phabricator.wikimedia.org/T209527) [18:29:08] okay, then I’ll just go home and check again tomorrow ^^ [18:29:11] thank you [18:31:30] (03CR) 10Elukey: Sync /srv/published-datasets from SWAP hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) (owner: 10Ottomata) [18:31:56] (03PS4) 10Ottomata: Sync /srv/published-datasets from SWAP hosts [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) [18:33:30] (03CR) 10jerkins-bot: [V: 04-1] Sync /srv/published-datasets from SWAP hosts [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) (owner: 10Ottomata) [18:35:51] (03PS5) 10Ottomata: Sync /srv/published-datasets from SWAP hosts [puppet] - 10https://gerrit.wikimedia.org/r/494501 (https://phabricator.wikimedia.org/T217619) [18:38:49] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:39:37] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:40:01] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:40:49] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:42:17] RECOVERY - puppet last run on mc1023 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [18:43:20] (03PS3) 10GTirloni: wmcs::nfs - Refactor snapshot_manager [puppet] - 10https://gerrit.wikimedia.org/r/494519 (https://phabricator.wikimedia.org/T209527) [18:45:07] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs - Refactor snapshot_manager [puppet] - 10https://gerrit.wikimedia.org/r/494519 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [18:47:11] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:48:31] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:49:37] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:49:43] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190305T1900) [19:03:56] (03PS1) 10GTirloni: wmcs::nfs::misc - Allow incoming SSH from backup server [puppet] - 10https://gerrit.wikimedia.org/r/494528 (https://phabricator.wikimedia.org/T209527) [19:04:11] (03PS2) 10BryanDavis: toolforge: remove obsolete mailrelay manifests [puppet] - 10https://gerrit.wikimedia.org/r/494516 (https://phabricator.wikimedia.org/T208843) [19:04:38] (03CR) 10jerkins-bot: [V: 04-1] wmcs::nfs::misc - Allow incoming SSH from backup server [puppet] - 10https://gerrit.wikimedia.org/r/494528 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [19:07:29] (03CR) 10Bstorm: "Since people who know much more than I do about this setup approved of it in the old location, I'm happy to just merge it." [puppet] - 10https://gerrit.wikimedia.org/r/494515 (https://phabricator.wikimedia.org/T213416) (owner: 10BryanDavis) [19:07:34] (03PS2) 10GTirloni: wmcs::nfs::misc - Allow incoming SSH from backup server [puppet] - 10https://gerrit.wikimedia.org/r/494528 (https://phabricator.wikimedia.org/T209527) [19:07:41] (03PS2) 10Bstorm: toolforge: Rewrite envelope From headers when relaying [puppet] - 10https://gerrit.wikimedia.org/r/494515 (https://phabricator.wikimedia.org/T213416) (owner: 10BryanDavis) [19:08:46] (03CR) 10Bstorm: [C: 03+2] toolforge: Rewrite envelope From headers when relaying [puppet] - 10https://gerrit.wikimedia.org/r/494515 (https://phabricator.wikimedia.org/T213416) (owner: 10BryanDavis) [19:10:10] (03CR) 10Bstorm: "Since this is theoretically more dangerous (cleaning up that is), I'll wait until the patch fixing mail problems is applied and then merge" [puppet] - 10https://gerrit.wikimedia.org/r/494516 (https://phabricator.wikimedia.org/T208843) (owner: 10BryanDavis) [19:10:23] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) It reduces set()/setInterimKey() calls to... [19:12:28] (03PS3) 10Bstorm: toolforge: remove obsolete mailrelay manifests [puppet] - 10https://gerrit.wikimedia.org/r/494516 (https://phabricator.wikimedia.org/T208843) (owner: 10BryanDavis) [19:15:50] (03PS3) 10GTirloni: wmcs::nfs::misc - Allow incoming SSH from backup server [puppet] - 10https://gerrit.wikimedia.org/r/494528 (https://phabricator.wikimedia.org/T209527) [19:23:24] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Allow incoming SSH from backup server [puppet] - 10https://gerrit.wikimedia.org/r/494528 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [19:32:29] (03PS1) 10GTirloni: wmcs::nfs::misc - Fix ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/494539 (https://phabricator.wikimedia.org/T209527) [19:34:43] (03CR) 10GTirloni: [C: 03+2] wmcs::nfs::misc - Fix ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/494539 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [19:42:48] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:55:58] James_F: thank you for the LdapAuthentication hotfix ;) [19:56:05] !log hashar@deploy1001 Synchronized php-1.33.0-wmf.20/extensions/LdapAuthentication/: Stop referring to the now-killed AuthPlugin class - T217692 (duration: 00m 57s) [19:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:08] T217692: labtestweb2001: Fatal error: unknown class AuthPlugin in /srv/mediawiki/php-1.33.0-wmf.20/extensions/LdapAuthentication/LdapAuthenticationPlugin.php on line 21 - https://phabricator.wikimedia.org/T217692 [19:58:30] (03PS1) 10Paladox: Fix that an endpoint cannot be used by two plugins anymore [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494540 [19:58:56] (03CR) 10Paladox: [V: 03+2 C: 03+2] Fix that an endpoint cannot be used by two plugins anymore [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/494540 (owner: 10Paladox) [20:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190305T2000) [20:31:06] (03PS3) 10Bmansurov: Disable reader demographics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493236 (https://phabricator.wikimedia.org/T217080) [20:31:17] (03CR) 10jerkins-bot: [V: 04-1] Disable reader demographics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493236 (https://phabricator.wikimedia.org/T217080) (owner: 10Bmansurov) [20:32:11] (03PS4) 10Bmansurov: Disable reader demographics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493236 (https://phabricator.wikimedia.org/T217080) [20:38:28] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2004.codfw.wmnet are marked down but pooled [20:38:50] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2004.codfw.wmnet are marked down but pooled [20:39:38] PROBLEM - LVS HTTP IPv4 on prometheus.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:52] PROBLEM - grafana.wikimedia.org on grafana1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:58] hmm [20:40:20] codfw only? [20:40:30] oh, no, grafana1001 too [20:40:37] hi [20:40:52] taking a look, only prometheus2004 was pooled [20:40:55] yeah, it is unavailable from my home [20:41:05] the public facing url [20:41:24] prometheus or grafana? [20:41:52] grafana is not working for me either [20:41:53] grafana.wm.o not loading for me in the browser, fwiw [20:41:54] grafana is hanging for me as well [20:41:58] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([prometheus2004.codfw.wmnet]) [20:42:05] yike [20:42:08] I will check prometheus [20:42:14] the web server is hanging for replying [20:42:18] but it could be related to prometheus maybe [20:42:31] I would not be surprised at grafana hanging if prometheus is down [20:42:40] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([prometheus2004.codfw.wmnet]) [20:43:00] prometheus consuming all cpu on 2004 [20:43:15] honestly, I don't know if prometheus is in a normal state because i see lots of threads but that may be normal [20:43:44] !log retarted apache on grafana1001 [20:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:59] wasn't there an issue before related to load? [20:44:06] yeah looks like the top user is the k8s instance [20:44:06] did that because apachectl status was hanging, fwiw [20:44:13] you mean with graphite jynus? [20:44:26] something something metrics, sorry I cannot remember [20:44:31] herron: grafana runs its own webserver too, you might need to restart it [20:44:36] herron: thanks, grafana is back for me [20:44:37] Alert Rule Result Error" logger=alerting.evalContext ruleId=61 name="TCP retransmits > 1% per [20:44:38] site:cluster alert" error="tsdb.HandleRequest() error Get http://prometheus.svc.codfw.wmnet/global/api/v1/query_range: context deadline exceeded" changing state to... etc [20:44:38] seems to be reachable now [20:44:38] RECOVERY - grafana.wikimedia.org on grafana1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49666 bytes in 0.166 second response time [20:44:46] cool, thanks! [20:44:47] weird [20:45:04] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [20:45:05] and then some stuff and 'Rendering timed out" logger=rendering' blah blah [20:45:07] grafana1001 [20:45:08] syslog [20:45:22] so prometheus was ok? [20:45:26] no [20:45:34] [Tue Mar 05 20:43:32.258684 2019] [mpm_event:error] [pid 7146:tid 139760082445504] AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting [20:45:42] RECOVERY - LVS HTTP IPv4 on prometheus.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 10959 bytes in 0.001 second response time [20:45:53] now it makes requests of prometheus and gets them ok (syslog) [20:46:02] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [20:46:15] herron: ahh, I'm going to guess that grafana wedged itself because all workers were hanging on requests going to prometheus2004 that don't have timeouts set [20:46:58] there's some sort of timeout on grafana's end I think but it's probably rather too long [20:47:08] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [20:47:35] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=prometheus2004&var-datasource=codfw%20prometheus%2Fops&var-cluster=prometheus&from=now-30m&to=now [20:47:48] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal [20:48:18] what was going on on prometheus2004 then? [20:48:34] it is not clear to me [20:48:35] is 2003 depooled or something? [20:48:40] it seems much more idle [20:48:56] yes 2003 is depooled atm [20:49:19] ah, so not the root cause there, but that would explain the consequences [20:49:23] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for sukhe - https://phabricator.wikimedia.org/T217438 (10Nuria) @ssingh do you have an NDA on file that lists the expire date for access? @Slaporte might know the answer to this. [20:49:31] I'm looking at the requests that hit 2004 if there's any obvious outlier [20:49:34] 2004 must have gotten a bunch of traffic [20:49:35] (reduced redundancy) [20:50:45] not much of an increase in network traffic [20:50:54] these queries might be really cheap though [20:51:00] packet-wise [20:51:13] yeah that's very easy to write in promql apergos [20:51:24] you can ask for the average value of a given metric over the past 3 months for instance :) [20:51:30] ugh [20:53:31] grafana1001:/var/log/grafana/grafana.log seems to have an increase in request errors starting 2019-03-05T20:35:11 [20:53:42] the main problem is "who watches the watchers" (montoring monitoring software is hard :-D) [20:53:44] I couldn't find any smoking gun as of which prometheus instance it was btw [20:54:29] in the meantime why were there a pile of [20:55:24] msg="Database table locked, sleeping then retrying" logger=sqlstore retry=3 in grafana's syslog from grafana-server at around 19:50 utc? [20:55:34] I know that's off topic but it sure doesn't excite me [20:55:39] filing it for later... [20:56:08] godog: I see a bunch of requests that returned ... 30MB of results for queries against the k8s prom [20:57:01] I think it is the citoid dashboard [20:57:51] cdanis: ow, from apache logs I take it? [20:57:57] but yeah easy to believe [20:58:52] https://phabricator.wikimedia.org/P8161 [20:58:58] NDAd because again IP addresses [20:59:08] (probably all internal but not sure) [20:59:21] PROBLEM - LVS HTTP IPv4 on prometheus.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:59:30] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2004.codfw.wmnet are marked down but pooled [20:59:31] grrrrr [20:59:43] I'm silencing that for now [20:59:50] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2004.codfw.wmnet are marked down but pooled [21:00:04] looks like apache again [21:00:06] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([prometheus2004.codfw.wmnet]) [21:00:42] !log restarted apache on grafana1001 [21:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:44] PROBLEM - grafana.wikimedia.org on grafana1001 is CRITICAL: HTTP CRITICAL - No data received from host [21:00:50] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([prometheus2004.codfw.wmnet]) [21:00:54] got a page about lvs [21:00:56] everything ok? [21:01:12] chaomodus: instability on prometheus [21:01:53] something causing requsts to back up and saturate the apache that fronts grafana [21:02:19] sorry, I meant grafana [21:04:05] ok I'm temporarily banning requests for the k8s prometheus instance via apache since clearly it isn't working [21:04:13] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for sukhe - https://phabricator.wikimedia.org/T217438 (10ssingh) >>! In T217438#5003026, @Nuria wrote: > @ssingh do you have an NDA on file that lists the expire date for access? @Slaporte might know the answer to this. June 30, 2019. Thanks. [21:04:22] sounds good [21:04:28] RECOVERY - grafana.wikimedia.org on grafana1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49666 bytes in 0.157 second response time [21:05:28] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [21:05:29] !log temporarily stop requests to k8s instance on prometheus2004 [21:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:33] RECOVERY - LVS HTTP IPv4 on prometheus.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 10959 bytes in 0.014 second response time [21:05:54] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [21:06:12] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal [21:06:14] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [21:07:32] (03PS1) 10Bmansurov: Enable reader trust survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494551 (https://phabricator.wikimedia.org/T217576) [21:08:05] ok I commented the proxypass in apache, a bit brutal but seems to have worked [21:08:29] at least the write path is still there so metrics are collected [21:08:43] (03PS1) 10Bmansurov: Disable reader trust survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494552 (https://phabricator.wikimedia.org/T217576) [21:09:41] I have a rant building inside of me about systems that don't log anything when they begin executing requests [21:10:34] more logging is good logging [21:12:10] indeed [21:13:05] did we change anything today about the cardinality of k8s metrics? [21:13:18] I've disabled puppet on prometheus2004 btw so the k8s instance ban remains in place for now [21:13:35] it is possible that ramping up services in k8s have increased the cardinality (?) [21:13:53] yeah it is possible [21:17:04] ok so next steps are for sure enforcing stricter timeouts all the way down, apache, prometheus and grafana at least [21:17:20] anyway executing some queries like http://prometheus.svc.codfw.wmnet/k8s/api/v1/query_range?query=sum(rate(service_runner_request_duration_seconds_count%7Bservice%3D%22citoid%22%7D%5B5m%5D))&start=1551776520&end=1551819780&step=60 seems ~200x as more expensive now than as it was 12 hours ago [21:17:50] maybe some of those could also be pre-aggregated ? [21:18:09] excuse me -- that was fine as of 2019-03-05T15:02:59 so just about six hours ago [21:18:24] (which I suspect was the last time anyone looked at the citoid dashboard) [21:18:45] jynus: yeah, having recording rules for some of these is probably a good idea [21:18:49] cdanis: https://grafana.wikimedia.org/d/000000445/kubernetes-pods?panelId=7&fullscreen&orgId=1&from=now-2d&to=now [21:18:56] probably not a coincidence [21:19:21] talk with service ops, pods are probably not but increasing soon [21:19:31] (more, I mean) [21:21:34] ok so we have a somewhat lame mitigation in place but no definitive root cause yet, as a status update [21:21:58] yeah [21:24:05] I don't have any particularly good ideas ATM on how to further mitigate this, especially at this time of day, although if someone has input is welcome of course [21:24:34] I don't know what to do aside from figure out what exploded in the tsdb [21:24:43] why is 2004 depooled? for data migration? [21:25:14] 2003 is depooled, yeah because of prometheus 2 migration [21:26:27] also begs the question of why eqiad seems to be ok [21:26:31] yeah [21:26:36] (03PS1) 10Volans: wikitech-static: add TXT records for Mailgun [dns] - 10https://gerrit.wikimedia.org/r/494561 (https://phabricator.wikimedia.org/T217599) [21:26:40] ... I just built and tried to use https://www.robustperception.io/using-tsdb-analyze-to-investigate-churn-and-cardinality [21:26:52] it bugs me that a prometheus issue in codfw can impact anything that's 'production'... I guess there's no way to tell grafana that honestly just ignore problems over there [21:26:52] but of course it won't work because it is tsdb v2 [21:27:20] ah [21:27:22] figures [21:27:29] indeed, v2 only [21:28:31] and no one is writing tools for v1 any more I guess? [21:29:46] (03CR) 10Herron: [C: 03+1] wikitech-static: add TXT records for Mailgun [dns] - 10https://gerrit.wikimedia.org/r/494561 (https://phabricator.wikimedia.org/T217599) (owner: 10Volans) [21:30:44] godog: I'm going to temporarily reenable k8s queries to look at prometheus's internal metrics about itself for k8s [21:32:01] cdanis: sure, it is going to fall over though I think [21:32:21] yeah already shut back off [21:32:32] an "alternative" would be to ssh-forward prometheus port and then use prometheus native ui [21:32:35] ah ok [21:33:03] https://i.imgur.com/zoW1fBf.png not exactly enlightening. [21:33:24] eyeroll [21:33:28] /14/3 [21:36:50] (03CR) 10Volans: [C: 03+2] wikitech-static: add TXT records for Mailgun [dns] - 10https://gerrit.wikimedia.org/r/494561 (https://phabricator.wikimedia.org/T217599) (owner: 10Volans) [21:36:53] for the service IP to be down, apache2 on that machine has to be unhealthy as well, right? [21:38:47] that's correct yes [21:39:54] looks like one problem amongst many is too much concurrency grafana -> apache on prometheus for sure [21:47:25] there's supposed to be a concurrency limit configurable in prometheus (says google), along with the total eval timer [22:00:51] (03PS3) 10Volans: icinga: add check_icinga script [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/493298 (https://phabricator.wikimedia.org/T217599) [22:24:25] 15:56:08 go.dog: I see a bunch of requests that returned ... 30MB of results for queries against the k8s prom [22:24:28] this was wrong btw [22:24:34] that wasn't bytes, that was duration in microseconds [22:24:38] so, 30 seconds [22:25:19] actual response size fairly small (order of a dozen kB or so) [22:36:05] 10Operations, 10monitoring: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10CDanis) [22:36:15] 10Operations, 10monitoring: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10CDanis) p:05Triage→03High [22:37:05] 10Operations, 10monitoring: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10CDanis) [22:39:08] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for sukhe - https://phabricator.wikimedia.org/T217438 (10Nuria) Can @slaporte confirm NDA was signed? [22:44:37] 10Operations, 10Mail, 10Phabricator: DomainKeys Identified Mail (DKIM) for phabricator.wikimedia.org - https://phabricator.wikimedia.org/T116805 (10Niedzielski) @aklapper, {icon thumbs-up}, good to know. Thanks! [23:02:24] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1058.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:02:36] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1059.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:02:50] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1060.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:03:04] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1061.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:03:15] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1062.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:03:32] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1063.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:03:45] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1064.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:03:57] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1065.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:04:10] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1066.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:04:22] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1067.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:04:35] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1068.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:11:05] (03PS1) 10RobH: decom cp10[58-68] prod dns [dns] - 10https://gerrit.wikimedia.org/r/494617 (https://phabricator.wikimedia.org/T208584) [23:12:01] (03CR) 10RobH: [C: 03+2] decom cp10[58-68] prod dns [dns] - 10https://gerrit.wikimedia.org/r/494617 (https://phabricator.wikimedia.org/T208584) (owner: 10RobH) [23:16:26] (03PS1) 10RobH: decom cp10[58-68] repo entries [puppet] - 10https://gerrit.wikimedia.org/r/494618 (https://phabricator.wikimedia.org/T208584) [23:17:15] (03CR) 10RobH: [C: 03+2] decom cp10[58-68] repo entries [puppet] - 10https://gerrit.wikimedia.org/r/494618 (https://phabricator.wikimedia.org/T208584) (owner: 10RobH) [23:19:48] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH) [23:20:08] (03PS1) 10CDanis: add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [23:20:29] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH) a:03Cmjohnson [23:20:52] (03CR) 10jerkins-bot: [V: 04-1] add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) (owner: 10CDanis) [23:21:55] 18:20:47 Line 3: Bug: value must be a single phabricator task ID [23:21:59] lol, but providing multiple values works [23:22:23] (03PS2) 10CDanis: add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [23:23:01] (03CR) 10jerkins-bot: [V: 04-1] add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) (owner: 10CDanis) [23:25:17] (03PS3) 10CDanis: add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [23:28:10] (03PS4) 10CDanis: add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [23:29:03] (03CR) 10jerkins-bot: [V: 04-1] add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) (owner: 10CDanis) [23:29:34] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP: groups: cannot find name for group ID - https://phabricator.wikimedia.org/T217280 (10bd808) Adding the #operations tag here to see if anyone from that group has time to investigate the LDAP server side. [23:30:32] cdanis the plugin only allows one Bug: line (so you can have mutiple Bug: but not multiple values in Bug:) [23:31:39] 10Operations, 10DC-Ops, 10Parsoid, 10decommission, 10Patch-For-Review: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10RobH) [23:31:54] yeah ty paladox :) [23:32:04] your welcome :) [23:32:05] I saw some old commits with multiple values in a single Bug: line [23:32:11] but they were likely _very_ old [23:32:13] by plugin i mean its-phabricator [23:33:57] (03PS5) 10CDanis: add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [23:34:47] (03CR) 10jerkins-bot: [V: 04-1] add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) (owner: 10CDanis) [23:34:48] cdanis which year were they from? 2014? [23:35:10] dunno, didn't look [23:35:23] literally just did: git log | grep 'Bug.*,' [23:35:25] on a hunch [23:35:29] prior to git transition, we used to put the bug in the subject line [23:35:54] So there's probably a pretty short time period after that but before the bug line became super standardfied [23:35:57] (03PS1) 10RobH: decom ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/494626 (https://phabricator.wikimedia.org/T216062) [23:36:31] (03PS6) 10CDanis: add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [23:36:40] (03PS1) 10RobH: decom ruthenium prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/494628 (https://phabricator.wikimedia.org/T216062) [23:37:08] (03CR) 10RobH: [C: 03+2] decom ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/494626 (https://phabricator.wikimedia.org/T216062) (owner: 10RobH) [23:37:28] (03CR) 10RobH: [C: 03+2] decom ruthenium prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/494628 (https://phabricator.wikimedia.org/T216062) (owner: 10RobH) [23:38:16] (03CR) 10jerkins-bot: [V: 04-1] add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) (owner: 10CDanis) [23:40:51] 10Operations, 10DC-Ops, 10Parsoid, 10decommission, 10Patch-For-Review: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10RobH) a:05RobH→03Cmjohnson [23:41:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10Parsoid, 10decommission: decom ruthenium - https://phabricator.wikimedia.org/T216062 (10RobH) [23:42:02] (03PS7) 10CDanis: add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [23:43:34] (03CR) 10jerkins-bot: [V: 04-1] add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) (owner: 10CDanis) [23:50:02] (03PS8) 10CDanis: add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [23:55:58] (03PS1) 10Andrew Bogott: bootstrap-vz: use a custom build of bootstrap-vz on Buster [puppet] - 10https://gerrit.wikimedia.org/r/494629 (https://phabricator.wikimedia.org/T216781) [23:56:17] (03PS9) 10CDanis: add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767) [23:56:31] (03CR) 10jerkins-bot: [V: 04-1] bootstrap-vz: use a custom build of bootstrap-vz on Buster [puppet] - 10https://gerrit.wikimedia.org/r/494629 (https://phabricator.wikimedia.org/T216781) (owner: 10Andrew Bogott) [23:58:02] (03PS2) 10Andrew Bogott: bootstrap-vz: use a custom build of bootstrap-vz on Buster [puppet] - 10https://gerrit.wikimedia.org/r/494629 (https://phabricator.wikimedia.org/T216781) [23:59:27] (03CR) 10Andrew Bogott: [C: 03+2] bootstrap-vz: use a custom build of bootstrap-vz on Buster [puppet] - 10https://gerrit.wikimedia.org/r/494629 (https://phabricator.wikimedia.org/T216781) (owner: 10Andrew Bogott) [23:59:59] (03PS10) 10CDanis: add uwsgi worker timeouts + max RSS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/494620 (https://phabricator.wikimedia.org/T116767)