[00:04:23] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: connect to address wikitech-static.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static [00:04:23] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/2773/ [00:04:25] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: connect to address wikitech-static.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Wikitech-static [00:05:17] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/2773/ [00:10:33] (03PS1) 10CDanis: sync_icinga_state: kludge nsca stuckness [puppet] - 10https://gerrit.wikimedia.org/r/499028 (https://phabricator.wikimedia.org/T196336) [00:13:45] loooooooks like certbot didn't restart apache right? [00:14:33] RECOVERY - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is OK: SSL OK - Certificate wikitech-static.wikimedia.org valid until 2019-06-23 23:01:36 +0000 (expires in 89 days) https://phabricator.wikimedia.org/project/view/2773/ [00:14:37] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 33525 bytes in 3.493 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:14:37] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 33525 bytes in 4.808 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:15:27] RECOVERY - HTTPS-wikitech-static on wikitech-static.wikimedia.org is OK: SSL OK - Certificate wikitech-static.wikimedia.org valid until 2019-06-23 23:01:36 +0000 (expires in 89 days) https://phabricator.wikimedia.org/project/view/2773/ [00:16:04] (03CR) 10Jforrester: "And, tah-dah, the i18n is now live on the Beta Cluster: https://commons.wikimedia.beta.wmflabs.org/wiki/MediaWiki:Wikibaselexeme-desc etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497989 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [00:16:18] (03CR) 10Jforrester: [C: 03+2] [BETA] Enable WikibaseLexemeCirrusSearch on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497990 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [00:17:37] (03Merged) 10jenkins-bot: [BETA] Enable WikibaseLexemeCirrusSearch on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497990 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [00:21:42] (03PS2) 10Alex Monk: scap: Make wmflabs php7 behaviour match prod's [puppet] - 10https://gerrit.wikimedia.org/r/499025 (https://phabricator.wikimedia.org/T219242) [00:23:01] (03CR) 10jenkins-bot: [BETA] Enable WikibaseLexemeCirrusSearch on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497990 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [00:26:08] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/499028 (https://phabricator.wikimedia.org/T196336) (owner: 10CDanis) [00:28:14] 10Operations, 10wikitech.wikimedia.org: wikitech-static cert about to expire - https://phabricator.wikimedia.org/T214640 (10CDanis) 05Resolved→03Open Looks like certbot renews the cert but doesn't restart apache correctly? 2019-03-26 00:04:23 <+icinga-wm> PROBLEM - Wikitech-static main page has content on... [00:37:15] (03PS1) 10BryanDavis: toolforge: add python-bs4 package [puppet] - 10https://gerrit.wikimedia.org/r/499029 (https://phabricator.wikimedia.org/T162570) [00:56:29] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) [01:09:11] (03PS3) 10Alex Monk: scap: Make wmflabs php7 behaviour match prod's [puppet] - 10https://gerrit.wikimedia.org/r/499025 (https://phabricator.wikimedia.org/T219242) [01:24:23] (03PS1) 10BryanDavis: toolforge: remove deleted Trusty host aliases [puppet] - 10https://gerrit.wikimedia.org/r/499031 (https://phabricator.wikimedia.org/T109485) [02:20:31] (03PS1) 10CRusnov: Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 [02:21:47] (03CR) 10jerkins-bot: [V: 04-1] Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [02:29:47] (03PS2) 10CRusnov: Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 [03:10:45] (03PS8) 10CRusnov: Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [03:12:55] (03PS9) 10CRusnov: Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [03:13:55] (03PS10) 10CRusnov: Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [03:17:08] (03CR) 10CRusnov: "As we discussed, moved the dynamic functionality to a small Spicerack module, which can obtain the master node from the API. Also use the " (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [03:18:15] (03CR) 10CRusnov: [C: 03+2] Add report which checks against puppetdb and compares serial numbers [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/495267 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [03:22:02] (03Abandoned) 10DannyS712: Change '/r/p/' to '/r/' for gerrit links [puppet] - 10https://gerrit.wikimedia.org/r/498167 (owner: 10DannyS712) [03:30:06] (03PS3) 10CRusnov: Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 [03:39:41] (03PS1) 10DannyS712: Remove the ability of non-administrators to move category pages on the english wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 [03:47:58] (03PS2) 10DannyS712: Remove the ability of non-administrators to move category pages on the english wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 [03:51:49] PROBLEM - puppet last run on ms-be1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:06:02] (03CR) 10Pppery: "There was some talk about giving the user right to bots and page movers (called "extendedmover" on gerrit) during the actual RfC. I sugges" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (owner: 10DannyS712) [04:10:29] PROBLEM - puppet last run on conf1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:18:11] RECOVERY - puppet last run on ms-be1037 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [04:36:47] RECOVERY - puppet last run on conf1005 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [04:51:52] (03CR) 10JJMC89: "See https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (owner: 10DannyS712) [05:06:03] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:33] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:24:27] PROBLEM - Check for gridmaster host resolution UDP on cloudservices1004 is CRITICAL: DNS CRITICAL - 0.017 seconds response time (No ANSWER SECTION found) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [05:30:10] (03PS2) 10Smalyshev: [BETA] Enable WikibaseLexemeCirrusSearch on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497991 (https://phabricator.wikimedia.org/T216206) [05:30:25] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:57] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:31:20] (03Abandoned) 10Smalyshev: [BETA] Enable WikibaseLexemeCirrusSearch on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497991 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [05:53:53] (03PS8) 10Marostegui: install_server: Add db1139,db1140 [puppet] - 10https://gerrit.wikimedia.org/r/498768 (https://phabricator.wikimedia.org/T218985) [05:55:12] (03CR) 10Marostegui: [C: 03+2] install_server: Add db1139,db1140 [puppet] - 10https://gerrit.wikimedia.org/r/498768 (https://phabricator.wikimedia.org/T218985) (owner: 10Marostegui) [05:55:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499047 [05:59:53] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499047 (owner: 10Marostegui) [06:01:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499047 (owner: 10Marostegui) [06:01:29] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See inline, a bit more work is needed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499025 (https://phabricator.wikimedia.org/T219242) (owner: 10Alex Monk) [06:01:53] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:02:11] !log Deploy schema change on db1106, this will generate lag on s1 on labs hosts [06:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1106 (duration: 00m 51s) [06:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:37] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:07:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499047 (owner: 10Marostegui) [06:08:12] (03PS1) 10Marostegui: site.pp: Add db1139 and db1140 as spares. [puppet] - 10https://gerrit.wikimedia.org/r/499048 (https://phabricator.wikimedia.org/T218985) [06:09:54] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I would prefer to keep the puppet compiler for prod hosts and labs VMs separated, for various reasons." [puppet] - 10https://gerrit.wikimedia.org/r/499026 (owner: 10Andrew Bogott) [06:19:07] (03PS2) 10Marostegui: site.pp: Add db1139 and db1140 as spares. [puppet] - 10https://gerrit.wikimedia.org/r/499048 (https://phabricator.wikimedia.org/T218985) [06:19:51] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:20:24] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10Vgutierrez) Currently tools.wmflabs.org is violating [[ https://tools.ietf.org/html/rfc6797#section-7.2 | RFC 6797 section 7.2 ]] by sen... [06:20:35] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:27:34] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10Vgutierrez) It's also violating [[ https://tools.ietf.org/html/rfc6797#section-7.1 | RFC 6797 section 7.1 ]] by sending the HSTS header... [06:32:04] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499050 [06:33:50] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499050 (owner: 10Marostegui) [06:34:50] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499050 (owner: 10Marostegui) [06:35:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1106 (duration: 00m 52s) [06:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:51] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499050 (owner: 10Marostegui) [06:42:23] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:01] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499051 [06:47:31] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:48:09] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:49:20] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499051 (owner: 10Marostegui) [06:50:19] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499051 (owner: 10Marostegui) [06:50:51] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499051 (owner: 10Marostegui) [06:51:22] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1119 (duration: 00m 50s) [06:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:31] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:57:11] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:10:27] (03PS9) 10Giuseppe Lavagetto: profile::mediawiki::maintenance: systemd-timer based periodic jobs [puppet] - 10https://gerrit.wikimedia.org/r/482792 (https://phabricator.wikimedia.org/T211250) [07:10:58] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499054 [07:12:27] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:07] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:14:26] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499054 (owner: 10Marostegui) [07:15:23] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499054 (owner: 10Marostegui) [07:16:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1119 (duration: 00m 49s) [07:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:17] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:23:55] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499054 (owner: 10Marostegui) [07:23:59] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:38:31] (03PS4) 10Jcrespo: BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) [07:38:51] (03CR) 10Filippo Giunchedi: [C: 03+2] Add Icinga alert to ping-offload dashboard alerts [puppet] - 10https://gerrit.wikimedia.org/r/498264 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [07:38:58] (03CR) 10jerkins-bot: [V: 04-1] BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [07:43:53] (03PS5) 10Jcrespo: BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) [07:44:14] (03CR) 10jerkins-bot: [V: 04-1] BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [07:44:27] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:45:01] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:52:17] (03PS1) 10Jcrespo: mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) [07:53:36] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [07:58:01] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:58:27] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:02:29] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:02:59] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:53] (03PS2) 10Jcrespo: mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) [08:08:59] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [08:09:26] (03PS6) 10Jcrespo: BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) [08:09:55] (03CR) 10jerkins-bot: [V: 04-1] BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [08:09:58] !log Deploy schema change on s2 codfw master, this will generate lag on codfw s2 [08:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:21] (03PS7) 10Jcrespo: BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) [08:11:43] (03CR) 10jerkins-bot: [V: 04-1] BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [08:12:00] (03PS3) 10Jcrespo: mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) [08:19:40] (03CR) 10Marostegui: [C: 03+2] site.pp: Add db1139 and db1140 as spares. [puppet] - 10https://gerrit.wikimedia.org/r/499048 (https://phabricator.wikimedia.org/T218985) (owner: 10Marostegui) [08:27:27] (03CR) 10Dzahn: [C: 03+2] add vikipedia.com as parked domain [dns] - 10https://gerrit.wikimedia.org/r/498905 (owner: 10Dzahn) [08:27:34] (03PS2) 10Dzahn: add vikipedia.com as parked domain [dns] - 10https://gerrit.wikimedia.org/r/498905 [08:32:23] (03PS6) 10Dzahn: mediawiki: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485104 [08:38:41] (03PS7) 10Dzahn: mediawiki: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485104 [08:43:09] (03PS3) 10Ema: varnish: set /w/load.php Age to 0 [puppet] - 10https://gerrit.wikimedia.org/r/496497 (https://phabricator.wikimedia.org/T105657) [08:44:11] (03PS8) 10Dzahn: mediawiki: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485104 [08:44:54] (03CR) 10Ema: [C: 03+2] varnish: set /w/load.php Age to 0 [puppet] - 10https://gerrit.wikimedia.org/r/496497 (https://phabricator.wikimedia.org/T105657) (owner: 10Ema) [08:45:17] (03PS9) 10Dzahn: mediawiki: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485104 [08:45:45] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15335/" [puppet] - 10https://gerrit.wikimedia.org/r/485104 (owner: 10Dzahn) [08:46:41] (03CR) 10Dzahn: [V: 03+1 C: 03+2] mediawiki: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/485104 (owner: 10Dzahn) [08:47:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] "oh, nice find!" [puppet] - 10https://gerrit.wikimedia.org/r/499028 (https://phabricator.wikimedia.org/T196336) (owner: 10CDanis) [08:50:50] (03CR) 10Dzahn: "noop wherever i checked (mwdebug, mw1261, mwmaint).. compiler..etc" [puppet] - 10https://gerrit.wikimedia.org/r/485104 (owner: 10Dzahn) [08:57:56] (03PS4) 10Dzahn: apertium: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456316 (https://phabricator.wikimedia.org/T194724) [08:58:36] (03CR) 10Dzahn: "needed before https://gerrit.wikimedia.org/r/c/operations/puppet/+/498429" [puppet] - 10https://gerrit.wikimedia.org/r/497788 (owner: 10Dzahn) [08:58:48] (03CR) 10Dzahn: "after: https://gerrit.wikimedia.org/r/c/operations/puppet/+/497788" [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [08:59:53] (03PS2) 10Dzahn: add Icinga notes_url to various NRPE monitor checks, pt 1 [puppet] - 10https://gerrit.wikimedia.org/r/498784 [09:01:06] (03CR) 10Dzahn: [C: 03+2] "mostly the same ones already used/reviewed with monitoring::service class" [puppet] - 10https://gerrit.wikimedia.org/r/498784 (owner: 10Dzahn) [09:03:44] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: raise mysql.connect_timeout to 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/499143 (https://phabricator.wikimedia.org/T211488) [09:03:46] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: remove references to tideways [puppet] - 10https://gerrit.wikimedia.org/r/499144 [09:05:17] I will be pooling prometheus1003 running v2 shortly, let me know if you see anything odd with dashboards [09:05:28] v2 to serve all eqiad queries that is [09:05:54] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus1003.eqiad.wmnet [09:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:06] (03PS1) 10Dzahn: osm: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/499145 [09:06:18] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [09:07:35] (03PS1) 10Gilles: Define asoranking scap target [puppet] - 10https://gerrit.wikimedia.org/r/499146 (https://phabricator.wikimedia.org/T209857) [09:08:24] (03CR) 10Giuseppe Lavagetto: "Please amend the two things I indicated here." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/485104 (owner: 10Dzahn) [09:08:45] (03CR) 10jerkins-bot: [V: 04-1] Define asoranking scap target [puppet] - 10https://gerrit.wikimedia.org/r/499146 (https://phabricator.wikimedia.org/T209857) (owner: 10Gilles) [09:09:26] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=prometheus1004.eqiad.wmnet [09:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:38] (03CR) 10Gilles: "Not sure what's failing? https://integration.wikimedia.org/ci/job/operations-puppet-tests-stretch-docker/8779/consoleText" [puppet] - 10https://gerrit.wikimedia.org/r/499146 (https://phabricator.wikimedia.org/T209857) (owner: 10Gilles) [09:13:20] (03CR) 10Dzahn: "modules/profile/manifests/analytics/asoranking.pp:6 ERROR tab character found (hard_tabs)" [puppet] - 10https://gerrit.wikimedia.org/r/499146 (https://phabricator.wikimedia.org/T209857) (owner: 10Gilles) [09:15:25] !log Restarting pdfrender on scb1001 [09:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:30] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.010 second response time https://phabricator.wikimedia.org/T174916 [09:17:44] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [09:18:40] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [09:19:36] (03PS1) 10Dzahn: add Icinga notes_url to various NRPE monitor checks, pt 2 [puppet] - 10https://gerrit.wikimedia.org/r/499148 [09:20:21] (03CR) 10Giuseppe Lavagetto: "I am unsure if it wouldn't be worth to incorporate this behaviour into puppet-facts-export.py, and let me explain why:" [puppet] - 10https://gerrit.wikimedia.org/r/499007 (owner: 10Andrew Bogott) [09:22:17] (03PS9) 10Dzahn: create a new role 'hmmp' to replace role(simplelamp) [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) [09:22:56] 10Operations, 10Traffic, 10monitoring: prometheus-based graph significantly slower than statsd equivalent - https://phabricator.wikimedia.org/T212312 (10ema) >>! In T212312#4866085, @CDanis wrote: > Anyway I'm making all the 'slow prometheus query' tasks sub-tasks of the prometheus 2.x upgrade T187987 as th... [09:28:21] (03CR) 10Dzahn: mediawiki: add data types to parameters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/485104 (owner: 10Dzahn) [09:29:12] (03PS2) 10Gilles: Define asoranking scap target [puppet] - 10https://gerrit.wikimedia.org/r/499146 (https://phabricator.wikimedia.org/T209857) [09:30:20] (03CR) 10jerkins-bot: [V: 04-1] Define asoranking scap target [puppet] - 10https://gerrit.wikimedia.org/r/499146 (https://phabricator.wikimedia.org/T209857) (owner: 10Gilles) [09:31:44] (03PS1) 10Dzahn: mediawiki: revert scap dirs to params, forward_syslog data type [puppet] - 10https://gerrit.wikimedia.org/r/499149 [09:32:12] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/499149" [puppet] - 10https://gerrit.wikimedia.org/r/485104 (owner: 10Dzahn) [09:32:47] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: revert scap dirs to params, forward_syslog data type [puppet] - 10https://gerrit.wikimedia.org/r/499149 (owner: 10Dzahn) [09:33:44] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15336/phab1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) (owner: 10Dzahn) [09:35:30] (03PS10) 10Dzahn: create a new role 'hmmp' to replace role(simplelamp) [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) [09:37:16] (03PS2) 10Dzahn: mediawiki: revert scap dirs to params, forward_syslog data type [puppet] - 10https://gerrit.wikimedia.org/r/499149 [09:38:47] (03PS3) 10Gilles: Define asoranking scap target [puppet] - 10https://gerrit.wikimedia.org/r/499146 (https://phabricator.wikimedia.org/T209857) [09:38:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This should be a profile, not a role. LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) (owner: 10Dzahn) [09:39:26] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Patch-For-Review: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10dcausse) [09:39:28] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [09:40:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/499149 (owner: 10Dzahn) [09:42:50] !log Upgrade db2070 [09:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:19] (03PS3) 10Gehel: don't change old cluster name [puppet] - 10https://gerrit.wikimedia.org/r/498937 (https://phabricator.wikimedia.org/T213940) (owner: 10Mathew.onipe) [09:43:36] (03PS8) 10Jcrespo: BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) [09:44:00] 10Operations, 10CirrusSearch, 10Discovery-Search: Elasticsearch 6: silence deprecation warnings to avoid logspam - https://phabricator.wikimedia.org/T219269 (10dcausse) [09:44:05] (03CR) 10jerkins-bot: [V: 04-1] BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [09:44:12] (03CR) 10Vgutierrez: redirects.dat - split non-canonical to separate section (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [09:44:18] (03PS4) 10Jcrespo: mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) [09:45:23] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [09:45:31] (03PS5) 10Jcrespo: mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) [09:45:46] (03CR) 10Gehel: [C: 03+2] don't change old cluster name [puppet] - 10https://gerrit.wikimedia.org/r/498937 (https://phabricator.wikimedia.org/T213940) (owner: 10Mathew.onipe) [09:46:34] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [09:47:27] 10Operations, 10Operations-Software-Development, 10Continuous-Integration-Config, 10Patch-For-Review: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435 (10fgiunchedi) >>! In T184435#5051826, @Volans wrote: > I think we could do some test of the real impact of migr... [09:47:33] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Epic: Epic: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10dcausse) [09:49:04] (03PS1) 10Mathew.onipe: icinga: increase check_interval for bulk update failure [puppet] - 10https://gerrit.wikimedia.org/r/499150 (https://phabricator.wikimedia.org/T214494) [09:49:40] (03CR) 10Arturo Borrero Gonzalez: "some minor comments inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499007 (owner: 10Andrew Bogott) [09:49:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499151 [09:50:04] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Epic: Epic: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10dcausse) [09:50:06] (03CR) 10Vgutierrez: redirects.dat - split non-canonical to separate section (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/292785 (https://phabricator.wikimedia.org/T133548) (owner: 10BBlack) [09:53:12] (03PS2) 10Arturo Borrero Gonzalez: toolforge: remove deleted Trusty host aliases [puppet] - 10https://gerrit.wikimedia.org/r/499031 (https://phabricator.wikimedia.org/T109485) (owner: 10BryanDavis) [09:53:31] (03PS1) 10Gilles: Element Timing for Images and Layout Stability on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499152 (https://phabricator.wikimedia.org/T216598) [09:53:33] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499151 (owner: 10Marostegui) [09:53:35] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15339/" [puppet] - 10https://gerrit.wikimedia.org/r/499149 (owner: 10Dzahn) [09:53:59] (03PS3) 10Dzahn: mediawiki: revert scap dirs to params, forward_syslog data type [puppet] - 10https://gerrit.wikimedia.org/r/499149 [09:54:03] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Tried with ifstat to monitor the interfac... [09:54:11] (03CR) 10Filippo Giunchedi: profile: kafkatee instance for udp2log compat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498386 (https://phabricator.wikimedia.org/T126989) (owner: 10Filippo Giunchedi) [09:54:34] !log Upgrade db2071 [09:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499151 (owner: 10Marostegui) [09:55:56] (03CR) 10Volans: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [09:56:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1105:3312 (duration: 00m 50s) [09:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:31] (03CR) 10Gehel: [C: 04-1] "minor comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499150 (https://phabricator.wikimedia.org/T214494) (owner: 10Mathew.onipe) [09:57:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499151 (owner: 10Marostegui) [09:57:44] (03PS3) 10Arturo Borrero Gonzalez: toolforge: remove deleted Trusty host aliases [puppet] - 10https://gerrit.wikimedia.org/r/499031 (https://phabricator.wikimedia.org/T109485) (owner: 10BryanDavis) [09:58:25] (03CR) 10Filippo Giunchedi: [C: 03+1] mtail: add hostname to systemd metric [puppet] - 10https://gerrit.wikimedia.org/r/498911 (owner: 10Jbond) [09:58:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: remove deleted Trusty host aliases [puppet] - 10https://gerrit.wikimedia.org/r/499031 (https://phabricator.wikimedia.org/T109485) (owner: 10BryanDavis) [09:59:00] (03CR) 10Dzahn: "it didn't seem to me that way, i would like an OK to install this package separate from the python code nitpicks" [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [10:06:26] (03PS6) 10Jcrespo: mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) [10:06:41] (03PS1) 10Vgutierrez: Allow LE issue the global unified wildcard certificate [dns] - 10https://gerrit.wikimedia.org/r/499154 (https://phabricator.wikimedia.org/T213705) [10:07:01] (03PS9) 10Jcrespo: BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) [10:07:26] (03CR) 10jerkins-bot: [V: 04-1] BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:08:22] (03PS6) 10Jcrespo: mariadb: Create socket dir also on puppet run, & for multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) [10:10:51] (03PS1) 10Vgutierrez: Add CAA records for wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/499155 (https://phabricator.wikimedia.org/T213705) [10:10:54] (03PS1) 10Vgutierrez: Allow LE issue the non-canonical redirects service certficate [dns] - 10https://gerrit.wikimedia.org/r/499156 (https://phabricator.wikimedia.org/T213705) [10:11:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] network::constants: Move hiera calls to the parameters [puppet] - 10https://gerrit.wikimedia.org/r/498796 (owner: 10Alex Monk) [10:12:42] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] BackupStatistics: Narrow the search for the treated backup [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/498782 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:13:27] (03PS7) 10Jcrespo: mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) [10:17:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] gitignore: Ignore all *.iml files [puppet] - 10https://gerrit.wikimedia.org/r/497306 (owner: 10Ladsgroup) [10:18:03] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10aborrero) I assume this is something in our nginx proxy, right? @Vgutierrez Could you please help us review the config? [10:18:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] "/me notes to self to never use IntelliJ" [puppet] - 10https://gerrit.wikimedia.org/r/497306 (owner: 10Ladsgroup) [10:18:26] (03PS1) 10Addshore: wikibase.php, define sharedCacheKeyGroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499158 [10:18:28] (03PS3) 10Alexandros Kosiaris: gitignore: Ignore all *.iml files [puppet] - 10https://gerrit.wikimedia.org/r/497306 (owner: 10Ladsgroup) [10:18:30] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] gitignore: Ignore all *.iml files [puppet] - 10https://gerrit.wikimedia.org/r/497306 (owner: 10Ladsgroup) [10:19:23] (03PS5) 10Elukey: Define asoranking scap target [puppet] - 10https://gerrit.wikimedia.org/r/499146 (https://phabricator.wikimedia.org/T209857) (owner: 10Gilles) [10:24:26] gilles: o/ [10:24:30] are you around? [10:27:23] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[performance/asoranking] [10:28:13] this is me, and deploy1001 is next [10:28:14] :) [10:28:26] that sounded like a threat [10:28:37] gilles: I'd need https://gerrit.wikimedia.org/r/#/c/performance/asoranking/+/498022/ merged, I mistakenly thought to do it by myself but I don't have perms [10:29:02] jijiki: nono a simple public self blame :D [10:29:17] (03PS4) 10Gehel: elasticsearch: upgrade to elastic 6.5.4 for cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498079 (https://phabricator.wikimedia.org/T218878) [10:29:19] (03PS3) 10Gehel: elasticsearch: deploy elasticsearch config for ES6 cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498080 (https://phabricator.wikimedia.org/T218878) [10:30:07] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/498796 (owner: 10Alex Monk) [10:30:09] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[performance/asoranking] [10:30:58] ah no wait jenkins might sae me [10:31:00] *save [10:31:53] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10Vgutierrez) >>! In T102367#5057208, @aborrero wrote: > I assume this is something in our nginx proxy, right? @Vgutierrez Could you pleas... [10:33:09] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[performance/asoranking] [10:33:53] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: deploy elasticsearch config for ES6 cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498080 (https://phabricator.wikimedia.org/T218878) (owner: 10Gehel) [10:35:23] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:35:28] (03CR) 10DCausse: [C: 03+1] elasticsearch: upgrade to elastic 6.5.4 for cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498079 (https://phabricator.wikimedia.org/T218878) (owner: 10Gehel) [10:35:44] (03PS8) 10Jcrespo: mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) [10:36:00] (03CR) 10DCausse: [C: 03+1] elasticsearch: deploy elasticsearch config for ES6 cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498080 (https://phabricator.wikimedia.org/T218878) (owner: 10Gehel) [10:36:07] PROBLEM - puppet last run on es1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:37:15] adding a new scap repo is always a joy [10:39:44] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Update data gathering code to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/499128 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [10:42:02] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: upgrade to elastic 6.5.4 for cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498079 (https://phabricator.wikimedia.org/T218878) (owner: 10Gehel) [10:42:18] (03CR) 10Jcrespo: [C: 03+1] db-codfw.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498322 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [10:42:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/15341/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498516 (https://phabricator.wikimedia.org/T217932) (owner: 10BryanDavis) [10:46:01] (03PS3) 10Volans: Add Python type hints and mypy check [software/spicerack] - 10https://gerrit.wikimedia.org/r/497290 [10:46:13] (03CR) 10Volans: "replies inline" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/497290 (owner: 10Volans) [10:46:46] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499159 [10:47:51] (03CR) 10Volans: [C: 04-1] "I agree with Giuseppe here. How would this even work? How the puppet compiler will know what roles to apply to a given FQDN?" [puppet] - 10https://gerrit.wikimedia.org/r/499007 (owner: 10Andrew Bogott) [10:48:12] hashar: o/ - I'd need your help with deploy2001 if you are around [10:48:25] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:48:37] (03CR) 10Volans: "Ignore my previous comment, wrong CR, sorry." [puppet] - 10https://gerrit.wikimedia.org/r/499007 (owner: 10Andrew Bogott) [10:49:18] (03CR) 10Volans: [C: 04-1] "I agree with Giuseppe here. How would this even work? How the puppet compiler will know what roles to apply to a given FQDN?" [puppet] - 10https://gerrit.wikimedia.org/r/499026 (owner: 10Andrew Bogott) [10:49:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] arclamp: Document YAML config file structure and logging logic [puppet] - 10https://gerrit.wikimedia.org/r/498932 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [10:49:48] gilles: I was able to deploy asoraking on stat1007, lemme know if everything is ok (need to fix a problem on deploy2001 to comlete the work but I'd need releng) [10:49:53] (03PS2) 10Giuseppe Lavagetto: arclamp: Document YAML config file structure and logging logic [puppet] - 10https://gerrit.wikimedia.org/r/498932 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [10:50:09] (03CR) 10Dzahn: create a new role 'hmmp' to replace role(simplelamp) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) (owner: 10Dzahn) [10:50:53] (03PS5) 10Gehel: elasticsearch: upgrade to elastic 6.5.4 for cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498079 (https://phabricator.wikimedia.org/T218878) [10:51:24] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499159 (owner: 10Marostegui) [10:52:08] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] arclamp: Document YAML config file structure and logging logic [puppet] - 10https://gerrit.wikimedia.org/r/498932 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [10:52:39] (03PS3) 10Elukey: role::memcached: apply interface::rps to all the hosts [puppet] - 10https://gerrit.wikimedia.org/r/472099 (https://phabricator.wikimedia.org/T203786) [10:53:07] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499159 (owner: 10Marostegui) [10:53:19] (03CR) 10Marostegui: "Would be nice to check with the PCC one more time to be sure it is fine" [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) (owner: 10Jcrespo) [10:53:24] (03CR) 10Gehel: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/497290 (owner: 10Volans) [10:53:32] (03CR) 10Gehel: [C: 03+2] elasticsearch: upgrade to elastic 6.5.4 for cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498079 (https://phabricator.wikimedia.org/T218878) (owner: 10Gehel) [10:53:39] (03PS1) 10Filippo Giunchedi: gerrit: send apache access/error logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/499161 (https://phabricator.wikimedia.org/T219271) [10:54:09] (03PS6) 10Gehel: elasticsearch: upgrade to elastic 6.5.4 for cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498079 (https://phabricator.wikimedia.org/T218878) [10:54:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1105:3312 (duration: 00m 49s) [10:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:47] (03PS2) 10Jbond: mtail: add hostname to systemd metric [puppet] - 10https://gerrit.wikimedia.org/r/498911 [10:55:20] (03PS2) 10Mathew.onipe: icinga: increase mjolnir bulk update check frequency [puppet] - 10https://gerrit.wikimedia.org/r/499150 (https://phabricator.wikimedia.org/T214494) [10:56:49] (03PS1) 10Jcrespo: mariadb: Prepare dbprov2001/2 and future dbprov1001/2 for production [puppet] - 10https://gerrit.wikimedia.org/r/499163 (https://phabricator.wikimedia.org/T218336) [10:57:04] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/15342/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/499161 (https://phabricator.wikimedia.org/T219271) (owner: 10Filippo Giunchedi) [10:57:10] (03CR) 10Mathew.onipe: icinga: increase mjolnir bulk update check frequency (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499150 (https://phabricator.wikimedia.org/T214494) (owner: 10Mathew.onipe) [10:57:33] (03CR) 10Volans: [C: 03+2] Add Python type hints and mypy check [software/spicerack] - 10https://gerrit.wikimedia.org/r/497290 (owner: 10Volans) [10:58:06] (03PS9) 10MarcoAurelio: Initial configuration for hyw.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481943 (https://phabricator.wikimedia.org/T212597) [10:58:43] (03PS3) 10Gehel: icinga: increase mjolnir bulk update check frequency [puppet] - 10https://gerrit.wikimedia.org/r/499150 (https://phabricator.wikimedia.org/T214494) (owner: 10Mathew.onipe) [10:58:47] folks interested in reviewing https://gerrit.wikimedia.org/r/c/operations/puppet/+/499161 ? should be straightfoward [10:59:26] (03CR) 10Jbond: [C: 03+2] mtail: add hostname to systemd metric [puppet] - 10https://gerrit.wikimedia.org/r/498911 (owner: 10Jbond) [11:00:01] (03PS3) 10Jbond: mtail: add hostname to systemd metric [puppet] - 10https://gerrit.wikimedia.org/r/498911 [11:00:04] 10Operations, 10Analytics, 10Product-Analytics, 10Patch-For-Review, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10aborrero) >>! In T212824#4967798, @elukey wrote: > Today I checked notebook1003 using the command `systemd-cgls memory`, that sho... [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190326T1100). [11:00:05] Urbanecm and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:09] (03CR) 10Gehel: [C: 03+2] icinga: increase mjolnir bulk update check frequency [puppet] - 10https://gerrit.wikimedia.org/r/499150 (https://phabricator.wikimedia.org/T214494) (owner: 10Mathew.onipe) [11:00:14] here [11:00:27] (03CR) 10Jcrespo: "@Marostegui: Please help me double check the regexes, not only for dbprov, but for other hosts/groups we may be missing" [puppet] - 10https://gerrit.wikimedia.org/r/499163 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [11:00:52] godog: does it not need a regex ? [11:01:12] mutante: no, shell glob [11:01:17] o/ [11:01:44] godog: i meant the startmsg_regex parameter [11:02:23] (03Merged) 10jenkins-bot: Add Python type hints and mypy check [software/spicerack] - 10https://gerrit.wikimedia.org/r/497290 (owner: 10Volans) [11:02:33] ah, that is just when you have multi-line logs. is that right? [11:02:35] mutante: ah, no not needed for apache logs, gerrit logs can be multiline though [11:02:38] yeah that's right [11:02:51] gotcha, thanks. lgtm [11:02:54] (03CR) 10Dzahn: [C: 03+1] gerrit: send apache access/error logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/499161 (https://phabricator.wikimedia.org/T219271) (owner: 10Filippo Giunchedi) [11:02:56] (03PS4) 10Jbond: mtail: add hostname to systemd metric [puppet] - 10https://gerrit.wikimedia.org/r/498911 [11:03:25] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499159 (owner: 10Marostegui) [11:04:07] (03PS2) 10Filippo Giunchedi: gerrit: send apache access/error logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/499161 (https://phabricator.wikimedia.org/T219271) [11:04:36] Amir1: can you deploy Urbanecm's patch too? [11:04:58] (03CR) 10Filippo Giunchedi: [C: 03+2] gerrit: send apache access/error logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/499161 (https://phabricator.wikimedia.org/T219271) (owner: 10Filippo Giunchedi) [11:05:01] zeljkof: sure [11:05:03] (03CR) 10Marostegui: "Last time I had to do something with the regex (can't remember what it was) I double checked all of them. Specially as we added new parser" [puppet] - 10https://gerrit.wikimedia.org/r/499163 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [11:05:09] Amir1: thanks! [11:05:23] Amir1: since, there's just two patches, swat is yours :) [11:05:45] (03PS3) 10Filippo Giunchedi: gerrit: send apache access/error logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/499161 (https://phabricator.wikimedia.org/T219271) [11:06:11] (03CR) 10Jcrespo: "I don't plan to work on this anytime soon." [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) (owner: 10Jcrespo) [11:06:26] Let me know if you need something from me [11:07:17] (03CR) 10Alex Monk: scap: Make wmflabs php7 behaviour match prod's (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499025 (https://phabricator.wikimedia.org/T219242) (owner: 10Alex Monk) [11:07:36] Urbanecm: it doesn't look testable [11:07:43] RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:07:52] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498949 (https://phabricator.wikimedia.org/T213869) (owner: 10Urbanecm) [11:07:59] Amir1, indeed, it's just a throttle rule :) [11:08:02] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] gerrit: send apache access/error logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/499161 (https://phabricator.wikimedia.org/T219271) (owner: 10Filippo Giunchedi) [11:08:21] I love easy patches, they are full of surprises [11:08:36] :D [11:09:16] (03Merged) 10jenkins-bot: Throttle rule for Wikimedia Hackathon 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498949 (https://phabricator.wikimedia.org/T213869) (owner: 10Urbanecm) [11:09:57] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:10:26] (03PS11) 10Dzahn: create a new role/profile 'hmmp' to replace role(simplelamp) [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) [11:10:33] (03PS2) 10Giuseppe Lavagetto: arclamp: Rename YAML file from xenon-log to arclamp-log-xenon [puppet] - 10https://gerrit.wikimedia.org/r/498933 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [11:10:53] Urbanecm: going live [11:10:56] thx [11:11:06] Amir1: if the string 'trivial' shows up it shoudl raise the warning level *g*:) [11:11:37] !log ladsgroup@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:498949|Throttle rule for Wikimedia Hackathon 2019 (T213869)]] (duration: 00m 51s) [11:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:41] T213869: Add throttle exception for Wikimedia hackathon 2019 in Prague - https://phabricator.wikimedia.org/T213869 [11:11:45] (03CR) 10jerkins-bot: [V: 04-1] create a new role/profile 'hmmp' to replace role(simplelamp) [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) (owner: 10Dzahn) [11:12:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] arclamp: Rename YAML file from xenon-log to arclamp-log-xenon [puppet] - 10https://gerrit.wikimedia.org/r/498933 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [11:13:03] (03CR) 10Dzahn: "ok... fixing " Parameter 'sqldata_dir' of class 'profile::hmmp' has no call to hiera"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) (owner: 10Dzahn) [11:13:38] <_joe_> mutante: maybe it shouldn't be a parameter :) [11:13:45] (03CR) 10jenkins-bot: Add Python type hints and mypy check [software/spicerack] - 10https://gerrit.wikimedia.org/r/497290 (owner: 10Volans) [11:13:58] _joe_: i was just thinking about that. since you also say that.. removing :) [11:14:36] (03PS10) 10Arturo Borrero Gonzalez: WMCS: introduce sssd, replacing nscd/nslcd [puppet] - 10https://gerrit.wikimedia.org/r/498359 (https://phabricator.wikimedia.org/T218126) [11:14:47] but ..it's because i copied simplelamp and theoretically some users could have changed it.. anyways.. [11:15:22] (03CR) 10jenkins-bot: Throttle rule for Wikimedia Hackathon 2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498949 (https://phabricator.wikimedia.org/T213869) (owner: 10Urbanecm) [11:15:38] i will have to write a mail to all the users of that class and encourage them to convert [11:15:55] last users of apache module [11:15:58] mutante: indeed :P [11:16:34] (03PS2) 10Ladsgroup: Set $wmgWikibaseSiteGroup for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498440 (https://phabricator.wikimedia.org/T217730) [11:16:55] (03PS1) 10Filippo Giunchedi: rsyslog: ship gerrit apache access logs [puppet] - 10https://gerrit.wikimedia.org/r/499171 (https://phabricator.wikimedia.org/T219271) [11:17:35] (03CR) 10Ladsgroup: [C: 03+2] Set $wmgWikibaseSiteGroup for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498440 (https://phabricator.wikimedia.org/T217730) (owner: 10Ladsgroup) [11:17:37] (03PS1) 10Alex Monk: network::constants: remove seemingly unused bastion_hosts extra hiera key [puppet] - 10https://gerrit.wikimedia.org/r/499172 [11:17:40] (03PS6) 10Ppchelko: Create node-specific logstash filters for syslog. [puppet] - 10https://gerrit.wikimedia.org/r/498417 (https://phabricator.wikimedia.org/T211125) [11:17:59] (03PS12) 10Dzahn: create a new role/profile 'hmmp' to replace role(simplelamp) [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) [11:18:58] (03Merged) 10jenkins-bot: Set $wmgWikibaseSiteGroup for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498440 (https://phabricator.wikimedia.org/T217730) (owner: 10Ladsgroup) [11:19:08] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: ship gerrit apache access logs [puppet] - 10https://gerrit.wikimedia.org/r/499171 (https://phabricator.wikimedia.org/T219271) (owner: 10Filippo Giunchedi) [11:19:20] (03PS2) 10Filippo Giunchedi: rsyslog: ship gerrit apache access logs [puppet] - 10https://gerrit.wikimedia.org/r/499171 (https://phabricator.wikimedia.org/T219271) [11:19:30] (03PS2) 10Dzahn: osm: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/499145 [11:20:19] I forgot to rebase the throttle patch, I'm syncing again [11:20:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/499172 (owner: 10Alex Monk) [11:20:30] (03CR) 10Dzahn: [C: 03+2] osm: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/499145 (owner: 10Dzahn) [11:20:37] !log ladsgroup@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:498949|Throttle rule for Wikimedia Hackathon 2019 (T213869)]], try II (duration: 00m 49s) [11:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:41] T213869: Add throttle exception for Wikimedia hackathon 2019 in Prague - https://phabricator.wikimedia.org/T213869 [11:21:18] (03PS3) 10Dzahn: osm: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/499145 [11:21:57] !log temporary install ifstat on mc1022 + tmux session to log in/out bandwidth usage every 1s for T203786 [11:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:03] T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [11:22:15] (03PS2) 10Giuseppe Lavagetto: arclamp: Make Redis subscription channel configurable [puppet] - 10https://gerrit.wikimedia.org/r/498934 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [11:25:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] arclamp: Make Redis subscription channel configurable [puppet] - 10https://gerrit.wikimedia.org/r/498934 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [11:25:41] (03CR) 10jenkins-bot: Set $wmgWikibaseSiteGroup for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498440 (https://phabricator.wikimedia.org/T217730) (owner: 10Ladsgroup) [11:26:01] (03PS3) 10Giuseppe Lavagetto: arclamp: Make Redis subscription channel configurable [puppet] - 10https://gerrit.wikimedia.org/r/498934 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [11:26:16] (03CR) 10Arturo Borrero Gonzalez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/498359 (https://phabricator.wikimedia.org/T218126) (owner: 10Arturo Borrero Gonzalez) [11:28:30] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:498440|Set $wmgWikibaseSiteGroup for wikimaniawiki (T217730)]] (duration: 00m 49s) [11:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:33] T217730: Connect wikimaniawiki to Wikidata - https://phabricator.wikimedia.org/T217730 [11:32:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] arclamp: Rename xenon-log script to arclamp-log in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/498942 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [11:34:51] jouncebot: next [11:34:51] In 0 hour(s) and 25 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190326T1200) [11:35:01] (03CR) 10Dzahn: "compiler still reports failures at https://puppet-compiler.wmflabs.org/compiler1002/15344/" [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) (owner: 10MSantos) [11:37:31] (03PS3) 10Giuseppe Lavagetto: arclamp: Rename xenon-log script to arclamp-log in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/498942 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [11:39:30] !log wikiadmin@db1078.eqiad.wmnet(wikimaniawiki)> DELETE FROM sites; and site_identifiers [11:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:55] PROBLEM - puppet last run on grafana1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:40:19] !log ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/lib/maintenance/populateSitesTable.php --wiki=wikimaniawiki --force-protocol https (T217730) [11:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:23] T217730: Connect wikimaniawiki to Wikidata - https://phabricator.wikimedia.org/T217730 [11:41:55] (03CR) 10Dzahn: [C: 03+1] "converted to profile. taking the "LGTM otherwise" from PS10 as a +1 :) not used yet and will be on cloud VPSes" [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) (owner: 10Dzahn) [11:42:01] !log EU SWAT is done [11:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:11] PROBLEM - Check Varnish expiry mailbox lag on cp3036 is CRITICAL: CRITICAL: expiry mailbox lag is 2041364 [11:49:20] (03CR) 10Dzahn: [C: 03+2] create a new role/profile 'hmmp' to replace role(simplelamp) [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) (owner: 10Dzahn) [11:49:58] (03PS13) 10Dzahn: create a new role/profile 'hmmp' to replace role(simplelamp) [puppet] - 10https://gerrit.wikimedia.org/r/489339 (https://phabricator.wikimedia.org/T215662) [11:56:07] 10Operations, 10Discovery, 10Discovery-Search (Current work), 10Patch-For-Review: Create extra elasticsearch clusters in beta cluster - https://phabricator.wikimedia.org/T213940 (10Mathew.onipe) Beta cluster is running three clusters now. Cluster state is fine for all them. chi: ` onimisionipe@deployment-... [11:56:42] (03PS11) 10Arturo Borrero Gonzalez: WMCS: introduce sssd, replacing nscd/nslcd [puppet] - 10https://gerrit.wikimedia.org/r/498359 (https://phabricator.wikimedia.org/T218126) [11:59:38] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Some TKOs happened from 11:38 to 11:44 UT... [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190326T1200) [12:05:16] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The program still only searches in /srv/log/xenon, which is not configurable. Shouldn't it change between xenon and excimer logs?" [puppet] - 10https://gerrit.wikimedia.org/r/498948 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [12:08:20] (03PS4) 10Giuseppe Lavagetto: arclamp: Rename xenon-grep to arclamp-grep [puppet] - 10https://gerrit.wikimedia.org/r/498948 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [12:10:49] RECOVERY - puppet last run on grafana1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [12:15:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "On second thoughts, let's merge like this now and I'll prepare a patch to allow using either channel later." [puppet] - 10https://gerrit.wikimedia.org/r/498948 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [12:20:09] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [12:39:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] arclamp: Rename provisioning of xenon-log to arclamp-log [puppet] - 10https://gerrit.wikimedia.org/r/498950 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [12:39:12] (03PS4) 10Giuseppe Lavagetto: arclamp: Rename provisioning of xenon-log to arclamp-log [puppet] - 10https://gerrit.wikimedia.org/r/498950 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [12:54:56] (03PS4) 10Gehel: elasticsearch: deploy elasticsearch config for ES6 cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498080 (https://phabricator.wikimedia.org/T218878) [12:56:01] (03CR) 10Gehel: [C: 03+2] elasticsearch: deploy elasticsearch config for ES6 cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498080 (https://phabricator.wikimedia.org/T218878) (owner: 10Gehel) [12:58:21] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:58:37] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.e6-upgrade [12:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:59] PROBLEM - puppet last run on db1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190326T1300) [13:00:18] checking [13:00:43] 503 Proxy Error [13:01:17] seen more of those lately? or just me? [13:04:41] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [13:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:13] RECOVERY - puppet last run on db1077 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:06:23] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [13:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:56] 10Operations: cronspam: cross-validate-accounts - https://phabricator.wikimedia.org/T219274 (10GTirloni) [13:11:06] (03PS1) 10Vgutierrez: acme_chief: Issue the global unified wildcard certificate [puppet] - 10https://gerrit.wikimedia.org/r/499185 (https://phabricator.wikimedia.org/T213705) [13:19:20] (03PS2) 10CDanis: sync_icinga_state: kludge nsca stuckness [puppet] - 10https://gerrit.wikimedia.org/r/499028 (https://phabricator.wikimedia.org/T196336) [13:21:04] (03CR) 10CDanis: [C: 03+2] sync_icinga_state: kludge nsca stuckness [puppet] - 10https://gerrit.wikimedia.org/r/499028 (https://phabricator.wikimedia.org/T196336) (owner: 10CDanis) [13:24:46] (03PS1) 10Filippo Giunchedi: phabricator: send apache access logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/499188 (https://phabricator.wikimedia.org/T219278) [13:26:40] (03PS1) 10Vgutierrez: acme_chief: Issue wikiba.se certificate [puppet] - 10https://gerrit.wikimedia.org/r/499189 (https://phabricator.wikimedia.org/T213705) [13:28:05] (03CR) 10Filippo Giunchedi: [C: 03+2] phabricator: send apache access logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/499188 (https://phabricator.wikimedia.org/T219278) (owner: 10Filippo Giunchedi) [13:33:21] (03PS1) 10Mholloway: Add WikimediaEditorTasks tables to the private tables list [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) [13:35:27] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [13:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:49] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [13:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:25] (03PS2) 10Gehel: add Icinga notes_url to various NRPE monitor checks, pt 2 [puppet] - 10https://gerrit.wikimedia.org/r/499148 (owner: 10Dzahn) [13:39:45] (03CR) 10Gehel: [C: 03+1] "LGTM for the wdqs part (I've updated the links)" [puppet] - 10https://gerrit.wikimedia.org/r/499148 (owner: 10Dzahn) [13:40:04] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10CDanis) Yesterday I had a fun adventure through UNIX internals diagnosing why the secondary icinga host winds up with lots of stuck `nsca` proces... [13:40:15] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665 (10Dzahn) a:03Dzahn [13:43:04] (03Abandoned) 10Mathew.onipe: Revert "multi-instance for elastic deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/498916 (owner: 10Mathew.onipe) [13:47:45] (03PS11) 10Urbanecm: Initial configuration for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) [13:48:52] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) (owner: 10Urbanecm) [13:49:16] (03CR) 10Gergő Tisza: [C: 03+1] Add WikimediaEditorTasks tables to the private tables list [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) (owner: 10Mholloway) [13:49:34] (03PS12) 10Urbanecm: Initial configuration for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) [13:50:23] (03PS1) 10Dzahn: add mapped IPv6 address to bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499197 (https://phabricator.wikimedia.org/T196665) [13:50:27] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) (owner: 10Urbanecm) [13:52:31] (03CR) 10Dzahn: [C: 03+2] add mapped IPv6 address to bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499197 (https://phabricator.wikimedia.org/T196665) (owner: 10Dzahn) [13:53:27] (03PS2) 10Dzahn: add mapped IPv6 address to bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499197 (https://phabricator.wikimedia.org/T196665) [13:53:58] (03PS13) 10Urbanecm: Initial configuration for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) [13:54:53] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) (owner: 10Urbanecm) [13:57:44] ACKNOWLEDGEMENT - Check Varnish expiry mailbox lag on cp3036 is CRITICAL: CRITICAL: expiry mailbox lag is 3356765 Ema Due to be cron-restarted at 17:23 today. Ema staring at things. [13:58:03] (03CR) 10Alex Monk: [C: 03+1] acme_chief: Clean old file based certificate files (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/498920 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [14:00:45] (03PS1) 10Dzahn: add IPv6 records for bast2002 [dns] - 10https://gerrit.wikimedia.org/r/499200 (https://phabricator.wikimedia.org/T196665) [14:01:41] !log rolling update of passenger on puppet masters [14:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:08] (03PS1) 10Vgutierrez: acme_chief: Issue the non-canonical redirect certificates [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) [14:02:10] (03PS2) 10Dzahn: add IPv6 records for bast2002 [dns] - 10https://gerrit.wikimedia.org/r/499200 (https://phabricator.wikimedia.org/T196665) [14:03:23] (03CR) 10Dzahn: [C: 03+2] "inet6 2620:0:860:2:208:80:153:54/64" [dns] - 10https://gerrit.wikimedia.org/r/499200 (https://phabricator.wikimedia.org/T196665) (owner: 10Dzahn) [14:03:25] (03PS1) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [14:03:53] (03PS1) 10Ottomata: eventgate-analytics - use different batch settings for kafka producers [deployment-charts] - 10https://gerrit.wikimedia.org/r/499204 (https://phabricator.wikimedia.org/T219032) [14:04:22] (03CR) 10jerkins-bot: [V: 04-1] add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn) [14:06:35] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: Puppet has 16 failures. Last run 3 minutes ago with 16 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh],File[/etc/profile.d/bash_autologout.sh],File[/etc/profile.d/field.sh],File[/usr/local/bin/gen_fingerprints] [14:07:33] (03CR) 10Dzahn: "host 2620:0:860:2:208:80:153:54" [dns] - 10https://gerrit.wikimedia.org/r/499200 (https://phabricator.wikimedia.org/T196665) (owner: 10Dzahn) [14:07:43] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:07:49] (03CR) 10Alex Monk: " I'm a bit worried about what effect that one may have on our account limits" [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [14:07:51] PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 19 failures. Last run 3 minutes ago with 19 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_established_connections],File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py],File[/usr/local/lib/nagios/plugins/check_systemd_state],File[/etc/smartmontools/run.d/20logger] [14:07:52] runs puppet on mw2220 [14:07:59] PROBLEM - puppet last run on cp5006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/robh] [14:08:07] eh.. [14:08:11] PROBLEM - puppet last run on mw2214 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:08:33] PROBLEM - puppet last run on analytics1074 is CRITICAL: CRITICAL: Puppet has 15 failures. Last run 3 minutes ago with 15 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh],File[/etc/profile.d/bash_autologout.sh],File[/etc/profile.d/field.sh],File[/usr/local/bin/gen_fingerprints] [14:08:40] i just rolled out an updated version of passenger had been running on rhodium since yesterday but could be that [14:08:47] ooh.. ah [14:08:52] (03CR) 10Jcrespo: "Question- is it possible that in the future this could be enabled on other wikibase installations (the ones I am thinking right now is Wik" [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) (owner: 10Mholloway) [14:08:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] network::constants: remove seemingly unused bastion_hosts extra hiera key [puppet] - 10https://gerrit.wikimedia.org/r/499172 (owner: 10Alex Monk) [14:09:05] PROBLEM - puppet last run on restbase-dev1004 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 4 minutes ago with 17 failures. Failed resources (up to 3 shown): File[/usr/local/bin/drain],File[/usr/local/bin/decommission],File[/etc/sysctl.d],File[/etc/rsyslog.d/10-puppet-agent.conf] [14:09:05] (03PS2) 10Alexandros Kosiaris: network::constants: remove seemingly unused bastion_hosts extra hiera key [puppet] - 10https://gerrit.wikimedia.org/r/499172 (owner: 10Alex Monk) [14:09:11] PROBLEM - puppet last run on wtp1029 is CRITICAL: CRITICAL: Puppet has 40 failures. Last run 4 minutes ago with 40 failures. Failed resources (up to 3 shown): File[/home/bblack],File[/home/andrew],File[/home/faidon],File[/home/rush] [14:09:13] PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:09:25] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 4 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml],File[/usr/local/bin/puppet-enabled],File[/usr/local/bin/prometheus-puppet-agent-stats],File[/etc/rsyslog.d] [14:09:26] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] network::constants: remove seemingly unused bastion_hosts extra hiera key [puppet] - 10https://gerrit.wikimedia.org/r/499172 (owner: 10Alex Monk) [14:09:37] PROBLEM - puppet last run on an-master1002 is CRITICAL: CRITICAL: Puppet has 32 failures. Last run 4 minutes ago with 32 failures. Failed resources (up to 3 shown): File[/home/jdrewniak],File[/home/ejegg],File[/home/jdcc],File[/home/bmansurov] [14:09:37] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/pybal],File[/etc/rsyslog.d/10-puppet-agent.conf],File[/etc/sysctl.d] [14:09:39] PROBLEM - puppet last run on dns2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:09:44] jbond42: i see no issue when running it manually on one of them [14:09:45] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 4 minutes ago with 5 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/wmcs-wikireplica-dns],File[/usr/local/sbin/wmcs-makedomain],File[/usr/local/sbin/wmcs-webproxy],File[/usr/local/sbin/wmcs-updateproxies] [14:09:45] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:09:49] PROBLEM - puppet last run on ms-be1029 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 4 minutes ago with 7 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh],File[/home/jmm],File[/home/jynus],File[/home/ema] [14:10:10] mutante: me neither [14:10:41] PROBLEM - puppet last run on restbase2007 is CRITICAL: CRITICAL: Puppet has 12 failures. Last run 7 minutes ago with 12 failures. Failed resources (up to 3 shown): File[/etc/vim/vimrc.local],File[/usr/local/bin/phaste],File[/root/.screenrc],File[/usr/local/lib/nagios/plugins/] [14:10:49] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Puppet has 13 failures. Last run 5 minutes ago with 13 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt-upgrade-activity],File[/usr/local/lib/nagios/plugins/check_ipmi_sensor],File[/usr/local/lib/nagios/plugins/check_hpssacli],File[/usr/local/lib/nagios/plugins/get-raid-status-hpssacli] [14:11:43] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:11:46] (03CR) 10Alex Monk: [C: 03+1] acme_chief: Clean old file based certificate files (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/498921 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [14:11:51] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:11:57] (03PS2) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [14:12:09] PROBLEM - puppet last run on cloudvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:12:15] run puppet lint locally; fix the whines; amend the commit; push it; forget to have git added before the amend :-/ [14:12:21] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 7 minutes ago with 17 failures. Failed resources (up to 3 shown): File[/home/ayounsi],File[/home/herron],File[/home/aborrero],File[/home/bstorm] [14:13:09] (03CR) 10jerkins-bot: [V: 04-1] add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn) [14:13:27] RECOVERY - puppet last run on mw2214 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:14:11] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh] [14:14:27] were these runs in flight when the a puppetmaster was restarted? [14:14:36] a puppetmaster* [14:14:38] looks like it, yes [14:14:43] Failed to open TCP connection to puppet:8140 (Connection refused - connect(2) for "puppet" port 8140) [14:14:46] it seems so yes [14:14:53] RECOVERY - puppet last run on an-master1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:14:59] RECOVERY - puppet last run on ms-be1029 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:15:11] (03PS3) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [14:15:29] ack, cloudvirt1014 is fine [14:15:59] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - use different batch settings for kafka producers [deployment-charts] - 10https://gerrit.wikimedia.org/r/499204 (https://phabricator.wikimedia.org/T219032) (owner: 10Ottomata) [14:16:03] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:16:07] (03CR) 10jerkins-bot: [V: 04-1] add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn) [14:16:57] PROBLEM - Check systemd state on puppetmaster1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:17:12] (03PS1) 10Ottomata: eventgate-analytics - 0.0.19 [deployment-charts] - 10https://gerrit.wikimedia.org/r/499205 [14:17:25] RECOVERY - puppet last run on cloudvirt1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:17:35] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:17:37] i ran it on a couple,. they were all fine [14:18:21] RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:18:31] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:19:22] fwiw, if there are more restarts to do, temporarily disabling puppet agents during the master restarts via cumin is one way to avoid a flurry of alerts [14:19:23] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:19:42] (03PS4) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [14:20:26] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - 0.0.19 [deployment-charts] - 10https://gerrit.wikimedia.org/r/499205 (owner: 10Ottomata) [14:20:28] (03CR) 10jerkins-bot: [V: 04-1] add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn) [14:20:31] herron: they are all done now thanks but yes will definetly keep in mind in future, sorry for the noise [14:20:54] (03PS1) 10Dzahn: add bastionhost role to bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499207 (https://phabricator.wikimedia.org/T196665) [14:21:21] PROBLEM - puppet last run on lvs5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:21:26] (03CR) 10Volans: [C: 04-1] "I don't think it works on failover, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [14:21:38] 👍 [14:21:45] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.859e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:23:01] PROBLEM - puppet last run on wdqs2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:23:07] (03PS1) 10Andrew Bogott: test patch -- do not merge [puppet] - 10https://gerrit.wikimedia.org/r/499208 [14:23:34] jbond42: puppetmaster1001 says systemd degraded, fwiw [14:23:40] sorry, 1002 [14:23:52] mutante: yes thanks looking now [14:24:12] (03CR) 10jerkins-bot: [V: 04-1] test patch -- do not merge [puppet] - 10https://gerrit.wikimedia.org/r/499208 (owner: 10Andrew Bogott) [14:24:14] (03CR) 10Mholloway: "Hmm, that's an excellent question. The next step planned for this extension is to suggest Commons images missing captions; building that i" [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) (owner: 10Mholloway) [14:26:05] (03PS2) 10Vgutierrez: acme_chief: Issue the global unified wildcard certificate [puppet] - 10https://gerrit.wikimedia.org/r/499185 (https://phabricator.wikimedia.org/T213705) [14:26:35] (03PS5) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [14:26:38] some days it does not pay to get out of bed and type anything, at least not for commit [14:27:07] RECOVERY - Check systemd state on puppetmaster1002 is OK: OK - running: The system is fully operational [14:28:59] (03CR) 10Andrew Bogott: "I'm not sure I understand the objections. It's true that we cannot make compiler runs for mixed groups of cloud VMs and production hosts," [puppet] - 10https://gerrit.wikimedia.org/r/499026 (owner: 10Andrew Bogott) [14:33:26] (03PS1) 10Odder: Correct logos for the Gujarati Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 [14:34:19] RECOVERY - puppet last run on cp5006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:35:21] RECOVERY - puppet last run on restbase-dev1004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:35:27] RECOVERY - puppet last run on wtp1029 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:35:29] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:35:41] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:35:55] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:35:57] RECOVERY - puppet last run on dns2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:36:00] (03CR) 10GTirloni: [C: 03+1] WMCS: introduce sssd, replacing nscd/nslcd (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/498359 (https://phabricator.wikimedia.org/T218126) (owner: 10Arturo Borrero Gonzalez) [14:36:01] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:36:03] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [14:36:23] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:36:59] RECOVERY - puppet last run on restbase2007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:37:20] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This violates our rule of" [puppet] - 10https://gerrit.wikimedia.org/r/498796 (owner: 10Alex Monk) [14:38:03] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:39:17] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:40:07] RECOVERY - puppet last run on analytics1074 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:44:12] (03PS5) 10GTirloni: wmcs: Add .py extension to various scripts [puppet] - 10https://gerrit.wikimedia.org/r/498379 (https://phabricator.wikimedia.org/T144169) [14:44:52] (03CR) 10Zfilipin: [C: 03+1] "Is this still relevant? Should it be merged or abandoned?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar) [14:45:16] (03CR) 10jerkins-bot: [V: 04-1] wmcs: Add .py extension to various scripts [puppet] - 10https://gerrit.wikimedia.org/r/498379 (https://phabricator.wikimedia.org/T144169) (owner: 10GTirloni) [14:46:37] (03PS32) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [14:46:50] 10Operations, 10wikitech.wikimedia.org: wikitech-static cert renewal seems to stop apache2 - https://phabricator.wikimedia.org/T214640 (10CDanis) [14:47:22] (03PS2) 10Vgutierrez: acme_chief: Issue wikiba.se certificate [puppet] - 10https://gerrit.wikimedia.org/r/499189 (https://phabricator.wikimedia.org/T213705) [14:47:24] (03PS2) 10Vgutierrez: acme_chief: Issue the non-canonical redirect certificates [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) [14:47:43] RECOVERY - puppet last run on lvs5001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [14:49:04] (03PS1) 10Andrew Bogott: Cloud monitoring: use the new sge grid master as a testing canary [puppet] - 10https://gerrit.wikimedia.org/r/499212 [14:49:18] (03PS6) 10GTirloni: wmcs: Add .py extension to various scripts [puppet] - 10https://gerrit.wikimedia.org/r/498379 (https://phabricator.wikimedia.org/T144169) [14:49:19] RECOVERY - puppet last run on wdqs2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:49:20] (03PS33) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [14:50:55] (03PS2) 10Andrew Bogott: Cloud monitoring: use the new sge grid master as a testing canary [puppet] - 10https://gerrit.wikimedia.org/r/499212 [14:51:15] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 3 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10akosiaris) >>! In T218268#5049090, @Ottomata wrote: > > `apt-get update` couldn't connect to the apt source IPv6 addresse... [14:52:08] (03CR) 10Alex Monk: "So what you're saying is, let's abolish any concept of a special_hosts variable for multiple different types of host, and have all uses of" [puppet] - 10https://gerrit.wikimedia.org/r/498796 (owner: 10Alex Monk) [14:52:59] (03CR) 10Gergő Tisza: [C: 03+1] "Yeah, there are plans to use other databases (Wikibase tables on Commons, but possibly non-wikibase tables too). That won't change the ext" [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) (owner: 10Mholloway) [14:53:43] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 3 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Ottomata) OF COURSE! Thank you Alex. Since (IIUC), the preferred default is that all servers get IPv6 addies, we should... [14:55:10] (03PS6) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [14:56:00] (03CR) 10Andrew Bogott: [C: 03+2] Cloud monitoring: use the new sge grid master as a testing canary [puppet] - 10https://gerrit.wikimedia.org/r/499212 (owner: 10Andrew Bogott) [14:56:45] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:56:46] (03PS1) 10CDanis: nsca-fail: add debug task note [puppet] - 10https://gerrit.wikimedia.org/r/499215 (https://phabricator.wikimedia.org/T196336) [14:56:57] (03CR) 10Alex Monk: "If so, it sounds like a fairly simple way around the problem. If I started working on doing that, would people be willing to review patche" [puppet] - 10https://gerrit.wikimedia.org/r/498796 (owner: 10Alex Monk) [14:57:31] (03CR) 10BBlack: [C: 03+1] Allow LE issue the global unified wildcard certificate [dns] - 10https://gerrit.wikimedia.org/r/499154 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [14:58:44] (03CR) 10BBlack: [C: 04-1] Add CAA records for wikiba.se (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/499155 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [14:58:45] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [14:58:57] (03CR) 10BBlack: [C: 03+1] Allow LE issue the non-canonical redirects service certficate [dns] - 10https://gerrit.wikimedia.org/r/499156 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [14:59:39] (03PS2) 10Dzahn: add bastionhost role to bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499207 (https://phabricator.wikimedia.org/T196665) [15:00:01] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 18 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:00:21] (03CR) 10Dzahn: [C: 03+2] add bastionhost role to bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499207 (https://phabricator.wikimedia.org/T196665) (owner: 10Dzahn) [15:00:48] (03PS7) 10GTirloni: wmcs: Add .py extension to various scripts [puppet] - 10https://gerrit.wikimedia.org/r/498379 (https://phabricator.wikimedia.org/T144169) [15:01:11] (03PS2) 10CDanis: nsca-fail: add debug task note [puppet] - 10https://gerrit.wikimedia.org/r/499215 (https://phabricator.wikimedia.org/T196336) [15:01:17] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [15:02:04] RECOVERY - Check for gridmaster host resolution TCP on labservices1001 is OK: DNS OK - 0.012 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:02:18] RECOVERY - Check for gridmaster host resolution UDP on labservices1002 is OK: DNS OK - 0.016 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:02:23] (03CR) 10CDanis: [C: 03+2] nsca-fail: add debug task note [puppet] - 10https://gerrit.wikimedia.org/r/499215 (https://phabricator.wikimedia.org/T196336) (owner: 10CDanis) [15:02:30] (03CR) 10CRusnov: "Thanks, i have checked systemd::timer::job, and it appears to support absent, and thus the latest PS addresses this." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [15:04:38] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [15:07:00] (03CR) 10Mholloway: "Yes, I'd agree that disabling temporarily to move tables later is acceptable in the event we need to do so." [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) (owner: 10Mholloway) [15:07:53] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [15:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:24] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [15:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:22] RECOVERY - Check for gridmaster host resolution TCP on labservices1002 is OK: DNS OK - 0.015 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:10:27] (03PS1) 10Giuseppe Lavagetto: arclamp: make arclamp-grep work with excimer logs as well [puppet] - 10https://gerrit.wikimedia.org/r/499219 (https://phabricator.wikimedia.org/T176916) [15:10:35] (03PS1) 10Giuseppe Lavagetto: arclamp: abstract arclamp::instance out of arclamp [puppet] - 10https://gerrit.wikimedia.org/r/499220 (https://phabricator.wikimedia.org/T176916) [15:10:38] (03PS1) 10Giuseppe Lavagetto: arclamp: remove previously absented files [puppet] - 10https://gerrit.wikimedia.org/r/499221 [15:10:40] (03PS1) 10Giuseppe Lavagetto: arclamp: add a second instance for excimer logs [puppet] - 10https://gerrit.wikimedia.org/r/499222 (https://phabricator.wikimedia.org/T176916) [15:11:50] (03CR) 10jerkins-bot: [V: 04-1] arclamp: abstract arclamp::instance out of arclamp [puppet] - 10https://gerrit.wikimedia.org/r/499220 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [15:12:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! I'd like Keith's opinion too" [puppet] - 10https://gerrit.wikimedia.org/r/498417 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [15:13:28] (03CR) 10Arturo Borrero Gonzalez: "LGTM, but:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498379 (https://phabricator.wikimedia.org/T144169) (owner: 10GTirloni) [15:14:13] (03PS1) 10Alexandros Kosiaris: Kubernetes default policy: Add Kafka IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/499223 (https://phabricator.wikimedia.org/T218268) [15:14:42] (03PS1) 10Dzahn: turn bast2001 into a spare, replaced by bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499224 (https://phabricator.wikimedia.org/T196665) [15:16:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:17:03] (03PS2) 10Giuseppe Lavagetto: arclamp: abstract arclamp::instance out of arclamp [puppet] - 10https://gerrit.wikimedia.org/r/499220 (https://phabricator.wikimedia.org/T176916) [15:17:05] (03PS2) 10Giuseppe Lavagetto: arclamp: remove previously absented files [puppet] - 10https://gerrit.wikimedia.org/r/499221 [15:17:07] (03PS2) 10Giuseppe Lavagetto: arclamp: add a second instance for excimer logs [puppet] - 10https://gerrit.wikimedia.org/r/499222 (https://phabricator.wikimedia.org/T176916) [15:19:10] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:19:31] (03CR) 10MSantos: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) (owner: 10MSantos) [15:19:46] (03PS7) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [15:20:15] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [15:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:36] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [15:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:44] RECOVERY - Check for gridmaster host resolution UDP on labservices1001 is OK: DNS OK - 0.013 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:25:34] (03PS7) 10MSantos: Pass flag use_nodejs10 for maps services [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) [15:26:53] (03PS1) 10Mholloway: Enable WikimediaEditorTasks on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) [15:27:01] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:03] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:27:03] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:29] (03CR) 10Mholloway: [C: 04-1] "hold until deploy time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [15:27:41] (03CR) 10Alexandros Kosiaris: [C: 04-2] "> So what you're saying is, let's abolish any concept of a special_hosts variable for multiple different types of host, and have all uses " [puppet] - 10https://gerrit.wikimedia.org/r/498796 (owner: 10Alex Monk) [15:31:12] (03PS2) 10Vgutierrez: Add CAA records for wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/499155 (https://phabricator.wikimedia.org/T213705) [15:31:14] (03PS2) 10Vgutierrez: Allow LE issue the non-canonical redirects service certificate [dns] - 10https://gerrit.wikimedia.org/r/499156 (https://phabricator.wikimedia.org/T213705) [15:31:16] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:31:27] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:31:28] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:29] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:40] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [15:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:46] (03PS8) 10GTirloni: wmcs: Add .py extension to various scripts [puppet] - 10https://gerrit.wikimedia.org/r/498379 (https://phabricator.wikimedia.org/T144169) [15:32:10] (03PS8) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [15:32:35] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:37] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:32:37] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:54] (03CR) 10Alex Monk: "Well, we'll see how I get on. I think I'm going to start with cumin_masters which has triggered this latest round of discussion about netw" [puppet] - 10https://gerrit.wikimedia.org/r/498796 (owner: 10Alex Monk) [15:33:38] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [15:34:43] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --dry-run --debug stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:34:44] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:34:44] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:04] (03CR) 10Arturo Borrero Gonzalez: puppet-merge: merge to wmcs puppetmasters as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497069 (owner: 10Andrew Bogott) [15:36:33] (03CR) 10Vgutierrez: Add CAA records for wikiba.se (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/499155 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [15:36:54] 10Operations, 10Traffic, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10jijiki) [15:36:56] (03CR) 10Vgutierrez: [C: 03+2] Allow LE issue the global unified wildcard certificate [dns] - 10https://gerrit.wikimedia.org/r/499154 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [15:37:12] (03PS2) 10Vgutierrez: Allow LE issue the global unified wildcard certificate [dns] - 10https://gerrit.wikimedia.org/r/499154 (https://phabricator.wikimedia.org/T213705) [15:38:44] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [15:39:32] (03PS9) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [15:39:34] I should really just give up on doing any changesets today given how it's going, but I'm digging in my heels now >_< [15:40:48] !log otto@deploy1001 scap-helm eventgate-analytics upgrade --help [namespace: eventgate-analytics, clusters: staging] [15:40:48] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:40:48] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:24] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [15:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:42] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging --version 52 -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:53] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [15:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:01] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging --version 0.0.16 -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:44:02] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:44:03] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:34] (03CR) 10Bstorm: "Created T219287 to circle back and test/fix details in maintain-kubeusers. I've found scripts that break on some versions of python aroun" [puppet] - 10https://gerrit.wikimedia.org/r/498379 (https://phabricator.wikimedia.org/T144169) (owner: 10GTirloni) [15:45:27] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging --version 0.0.16 -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:45:28] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:29] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:06] RECOVERY - Check for gridmaster host resolution TCP on cloudservices1003 is OK: DNS OK - 0.020 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:47:14] RECOVERY - Check for gridmaster host resolution UDP on cloudservices1004 is OK: DNS OK - 0.016 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:47:28] RECOVERY - Check for gridmaster host resolution UDP on cloudservices1003 is OK: DNS OK - 0.014 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:47:34] (03PS1) 10Lucas Werkmeister (WMDE): Update Daniel Kinzler’s email address [puppet] - 10https://gerrit.wikimedia.org/r/499230 [15:47:36] PROBLEM - Host cloudvirt1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 3 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Ottomata) After ^: ` kafka-jumbo1003.eqiad.wmnet:9092/bootstrap: Connected to ipv6#[2620:0:861:102:10:64:16:99]:9092 ` L... [15:50:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] Kubernetes default policy: Add Kafka IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/499223 (https://phabricator.wikimedia.org/T218268) (owner: 10Alexandros Kosiaris) [15:50:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Tested in staging, worked fine" [puppet] - 10https://gerrit.wikimedia.org/r/499223 (https://phabricator.wikimedia.org/T218268) (owner: 10Alexandros Kosiaris) [15:51:46] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:52:32] (03CR) 10Krinkle: [C: 04-1] "The directory isn't going to be configurable. arclamp-log doesn't do that either." [puppet] - 10https://gerrit.wikimedia.org/r/499219 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [15:52:59] (03CR) 10Krinkle: [C: 04-1] "Once that actually exists, we can pick up this patch and work it in. Given it's only for manual use, it's okay to be following the change " [puppet] - 10https://gerrit.wikimedia.org/r/499219 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [15:54:24] (03CR) 10BryanDavis: service::uwsgi: Allow instances to disable logging config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498516 (https://phabricator.wikimedia.org/T217932) (owner: 10BryanDavis) [15:55:11] (03CR) 10Krinkle: arclamp: abstract arclamp::instance out of arclamp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499220 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [15:55:37] (03CR) 10Bstorm: wmcs: Add .py extension to various scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498379 (https://phabricator.wikimedia.org/T144169) (owner: 10GTirloni) [15:56:07] hi ops :] I'd like to close a mailing list (wikimetrics). I've searched the docs on wiki, but could not find anything related. Can you help me please? [15:56:26] (03CR) 10Krinkle: [C: 04-1] "Per CR elsewhere, I'd rather not introduce an entirely separate HTTP serving and user-interface on performance.wikimedia.org. They'll just" [puppet] - 10https://gerrit.wikimedia.org/r/499222 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [15:56:54] (03CR) 10Ayounsi: [C: 03+2] Logstash: add Icinga notifications parsing [puppet] - 10https://gerrit.wikimedia.org/r/498443 (owner: 10Ayounsi) [15:57:07] (03PS3) 10Ayounsi: Logstash: add Icinga notifications parsing [puppet] - 10https://gerrit.wikimedia.org/r/498443 [15:57:47] (03CR) 10Alexandros Kosiaris: Cron to run script to purge old CX drafts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [15:57:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [15:58:01] (03PS1) 10Urbanecm: Add throttle rule for Czech editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499231 (https://phabricator.wikimedia.org/T219291) [15:58:04] RECOVERY - Host cloudvirt1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [16:00:04] godog and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190326T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10Cmjohnson) 05Open→03Resolved The cpu has been swapped out, logged cleared return tracking USPS 9202 3945 5301... [16:00:46] 10Operations, 10ops-eqiad, 10DC-Ops: cloudvirt1015: update raid config and move to 10Gb - https://phabricator.wikimedia.org/T217140 (10Cmjohnson) This server is already in a 10G rack, I went to try and connect it today and while there is a 10G network card, I could figure out how to enable it. Can someone e... [16:00:56] RECOVERY - Check for gridmaster host resolution TCP on cloudservices1004 is OK: DNS OK - 0.018 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:01:02] (03CR) 10BBlack: [C: 03+1] Add CAA records for wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/499155 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [16:03:02] (03CR) 10GTirloni: wmcs: Add .py extension to various scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498379 (https://phabricator.wikimedia.org/T144169) (owner: 10GTirloni) [16:05:34] !log decom of labtestvirt200[12] started via T218023 [16:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:38] T218023: decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 [16:07:04] (03CR) 10Krinkle: profile::mediawiki::maintenance: systemd-timer based periodic jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482792 (https://phabricator.wikimedia.org/T211250) (owner: 10Giuseppe Lavagetto) [16:07:13] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10RobH) asw-b-codfw: ge-5/0/8 up down labtestvirt2002-eth0 ge-5/0/17 up down labtestvirt2001-eth0 ge... [16:07:21] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [16:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:43] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [16:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:07] (03CR) 10Giuseppe Lavagetto: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/499219 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [16:09:20] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:09:34] (03CR) 10Gergő Tisza: [C: 03+1] Enable WikimediaEditorTasks on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [16:10:34] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:10:56] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for labtestvirt2001.codfw.wmnet and performed the follo... [16:11:10] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for labtestvirt2002.codfw.wmnet and performed the follo... [16:11:12] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:12:58] (03PS9) 10GTirloni: wmcs: Add .py extension to various scripts [puppet] - 10https://gerrit.wikimedia.org/r/498379 (https://phabricator.wikimedia.org/T144169) [16:13:13] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10RobH) [16:14:10] (03CR) 10Giuseppe Lavagetto: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/499222 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [16:14:48] (03PS10) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [16:15:53] (03CR) 10Jcrespo: "> That won't change the extension schema much since it is already stored in x1" [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) (owner: 10Mholloway) [16:16:35] (03PS1) 10RobH: decom of labtestvirt200[12] [puppet] - 10https://gerrit.wikimedia.org/r/499235 (https://phabricator.wikimedia.org/T218023) [16:16:37] (03PS2) 10Jcrespo: Add WikimediaEditorTasks tables to the private tables list [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) (owner: 10Mholloway) [16:17:37] (03CR) 10RobH: [C: 03+2] decom of labtestvirt200[12] [puppet] - 10https://gerrit.wikimedia.org/r/499235 (https://phabricator.wikimedia.org/T218023) (owner: 10RobH) [16:17:59] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:18:55] 10Operations, 10WMF-NDA-Requests: Volunteer NDA for Alex Monk - https://phabricator.wikimedia.org/T218448 (10Andrew) a:03Andrew [16:19:33] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:20:11] (03PS1) 10RobH: decom labtestvirt200[12] prod dns [dns] - 10https://gerrit.wikimedia.org/r/499237 (https://phabricator.wikimedia.org/T218023) [16:20:53] ah! [16:20:53] (03CR) 10RobH: [C: 03+2] decom labtestvirt200[12] prod dns [dns] - 10https://gerrit.wikimedia.org/r/499237 (https://phabricator.wikimedia.org/T218023) (owner: 10RobH) [16:21:15] eww =[ [16:21:27] so the memcached alert might be mcrouter doing the usual tko thing [16:22:05] yeah [16:22:14] https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=9&fullscreen&orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All [16:23:09] all for mc1022 [16:25:17] (03CR) 10Krinkle: [C: 03+1] profile::mediawiki::php: remove references to tideways [puppet] - 10https://gerrit.wikimedia.org/r/499144 (owner: 10Giuseppe Lavagetto) [16:25:33] (03CR) 10Krinkle: [C: 03+1] profile::mediawiki::php: raise mysql.connect_timeout to 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/499143 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [16:25:43] (03PS1) 10Vgutierrez: redirects.dat: Get rid of domains non controlled by WMF [puppet] - 10https://gerrit.wikimedia.org/r/499239 (https://phabricator.wikimedia.org/T213705) [16:27:21] (03PS3) 10Jcrespo: Add WikimediaEditorTasks tables to the private tables list [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) (owner: 10Mholloway) [16:28:55] should recover soon, but based on the recent discoveries in the related task it is the interface getting saturated for a couple of seconds causing this [16:30:58] !log gilles@deploy1001 Started deploy [performance/asoranking@9a1e5ef]: (no justification provided) [16:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:50] !log gilles@deploy1001 Finished deploy [performance/asoranking@9a1e5ef]: (no justification provided) (duration: 00m 52s) [16:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:21] (03PS3) 10Vgutierrez: acme_chief: Issue the non-canonical redirect certificates [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) [16:33:40] (03PS11) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [16:33:45] 10Operations, 10serviceops, 10Patch-For-Review: Use PHP7 to run maintenance scripts - https://phabricator.wikimedia.org/T219135 (10Krinkle) >>! In T219135#5054661, @Dzahn wrote: >>>! In T219135#5054561, @Krinkle wrote: >> Duplicate of T195392? > > Yes. Just not sure how to merge them best. Unfortunately Pha... [16:34:14] 10Operations, 10Core Platform Team Backlog (Watching / External), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) [16:34:19] 10Operations, 10serviceops, 10Patch-For-Review: Use PHP7 to run maintenance scripts - https://phabricator.wikimedia.org/T219135 (10Krinkle) [16:34:32] 10Operations, 10serviceops: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Krinkle) [16:34:34] 10Operations, 10Core Platform Team Backlog (Watching / External), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) [16:35:21] (03CR) 10jerkins-bot: [V: 04-1] add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn) [16:35:32] 10Operations, 10Core Platform Team Backlog (Watching / External), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) >>! In T219135#5053347, @gerritbot wrote: > Change 498845 **merged** by Dzahn: > [operations/puppet@production] mediawiki::... [16:35:57] 10Operations, 10Core Platform Team Backlog (Watching / External), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) [16:36:24] 10Operations, 10Core Platform Team Backlog (Watching / External), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) [16:36:30] 10Operations, 10serviceops, 10Patch-For-Review: Use PHP7 to run maintenance scripts - https://phabricator.wikimedia.org/T219135 (10Krinkle) [16:36:37] 10Operations, 10Services, 10Core Platform Team Backlog (Watching / External), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) [16:37:13] (03CR) 10Nikerabbit: Cron to run script to purge old CX drafts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [16:37:51] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:39:39] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [16:40:07] (03PS3) 10Vgutierrez: Allow LE issue the non-canonical redirects service certificate [dns] - 10https://gerrit.wikimedia.org/r/499156 (https://phabricator.wikimedia.org/T213705) [16:40:55] (03CR) 10Vgutierrez: [C: 03+2] Add CAA records for wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/499155 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [16:41:03] (03PS3) 10Vgutierrez: Add CAA records for wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/499155 (https://phabricator.wikimedia.org/T213705) [16:41:20] (03PS12) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [16:41:25] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [16:42:45] (03CR) 10Volans: Add system timer for running ganeti->netbox sync. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [16:43:28] really I should have just done nothing today and done this changeset in 5 minutes tomorrow when my brain has decided to stop being a jerk... [16:44:02] jerk brains are teh worst :\ [16:44:16] they are [16:44:25] and cussing hem out doesn't seem to help either [16:44:30] *them [16:45:26] (03CR) 10Krinkle: [C: 04-1] "It may require 1 or 2 lines to be added to arclamp-grep, but that's not a concern imho. We almost never use it and it's a small price to p" [puppet] - 10https://gerrit.wikimedia.org/r/499219 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [16:48:26] (03PS3) 10Cwhite: hiera: upgrade prometheus-node-exporter to 0.17 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/490690 (https://phabricator.wikimedia.org/T213708) [16:49:24] (03CR) 10KartikMistry: [C: 03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (owner: 10Odder) [16:50:53] 10Operations, 10Mail: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408 (10Vgutierrez) [16:53:56] (03PS1) 10CRusnov: Adjust the way results are supported for out-of-scope status [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 [16:57:15] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:17] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:57:17] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:57] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [16:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:01] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [16:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:30] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.e6-upgrade [16:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:04] !log otto@deploy1001 scap-helm eventgate- upgrade staging -f eventgate-analytics-staging-values.yaml --dry-run --debug stable/eventgate-analytics [namespace: eventgate-, clusters: staging] [16:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:12] !log otto@deploy1001 scap-helm eventgate- upgrade staging -f eventgate-analytics-staging-values.yaml --dry-run --debug stable/eventgate-analytics [namespace: eventgate-, clusters: staging] [16:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:17] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --dry-run --debug stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:59:19] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:59:19] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:20] (03CR) 10Ayounsi: [C: 03+1] "Not tested, but logic and code looks good." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 (owner: 10CRusnov) [16:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:22] (03PS13) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190326T1700). [17:01:40] (03PS1) 10Ottomata: eventgate-analytics comment fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/499252 [17:01:42] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Reclaim/Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as soon as they are ready - https://phabricator.wikimedia.org/T216749 (10Bstorm) 05Stalled→03Open a:03Bstorm I believe this is ready to move forward now. [17:01:56] (03PS34) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [17:03:07] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --dry-run --debug stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [17:03:08] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [17:03:08] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:21] (03CR) 10CRusnov: Add system timer for running ganeti->netbox sync. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [17:03:49] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:05:55] (03PS13) 10KartikMistry: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) [17:06:44] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --dry-run --debug stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [17:06:45] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [17:06:45] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:44] (03PS1) 10Thcipriani: gerrit: disable new auth healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/499254 [17:09:59] (03CR) 10CRusnov: "Compiler output looks good on both active and passive:" [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [17:10:01] (03PS1) 10Vgutierrez: Add default SPF record for canonical domains [dns] - 10https://gerrit.wikimedia.org/r/499255 (https://phabricator.wikimedia.org/T193408) [17:10:03] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [17:10:04] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [17:10:04] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:11] (03PS1) 10Thcipriani: Bump submodules for 2.15.12 release [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/499256 [17:12:41] (03CR) 10Paladox: [C: 03+1] gerrit: disable new auth healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/499254 (owner: 10Thcipriani) [17:12:41] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --dry-run --debug stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [17:12:42] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [17:12:43] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:04] (03CR) 10Paladox: [C: 03+2] Bump submodules for 2.15.12 release [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/499256 (owner: 10Thcipriani) [17:13:32] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [17:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:19] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [17:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:54] (03CR) 10CRusnov: "my bad:" [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [17:17:34] (03PS14) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [17:17:35] 10Operations, 10Mail, 10Patch-For-Review: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408 (10Vgutierrez) p:05Normal→03High After almost one year I think it's time to move forward this task. With https://gerrit.wikimedia.org/r/c/operations/dns/+/499255 I suggest using the SPF... [17:18:16] (03CR) 10Dzahn: [C: 03+2] gerrit: disable new auth healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/499254 (owner: 10Thcipriani) [17:19:03] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:19:50] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: Make the user agent configurable for Wikidata Query Service Updater - https://phabricator.wikimedia.org/T217896 (10Smalyshev) 05Open→03Resolved [17:20:24] 10Operations, 10CirrusSearch, 10Discovery-Search (Current work): Elasticsearch 6: silence deprecation warnings to avoid logspam - https://phabricator.wikimedia.org/T219269 (10dcausse) [17:21:23] (03PS2) 10CRusnov: Adjust the way results are supported for out-of-scope status [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 [17:21:32] !log arlolra@deploy1001 Started deploy [parsoid/deploy@395a214]: Updating Parsoid to f58c3d1 [17:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:55] (03CR) 10Krinkle: [C: 03+1] db-codfw.php: Change parsercache key (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498322 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [17:23:54] (03CR) 10CRusnov: "Fixed the logic based on testing." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 (owner: 10CRusnov) [17:24:30] 10Operations, 10Office-IT, 10Research, 10Wikimedia-Mailing-lists: Create research-alerts mailing list - https://phabricator.wikimedia.org/T219309 (10bmansurov) [17:24:32] 10Operations, 10WMF-NDA-Requests: Volunteer NDA for Alex Monk - https://phabricator.wikimedia.org/T218448 (10Andrew) I've added @Krenair to the 'cloudinfra' project so that he can start working on puppetmasters. We may add him to the cloud root keys as well, as needed. [17:25:31] (03PS1) 10Ottomata: eventgate-analytics - use eventgate.analytics.error.validation for error stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/499263 [17:25:47] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics comment fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/499252 (owner: 10Ottomata) [17:26:10] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - use eventgate.analytics.error.validation for error stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/499263 (owner: 10Ottomata) [17:27:54] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [17:28:23] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@395a214]: Updating Parsoid to f58c3d1 (duration: 06m 51s) [17:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:08] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) [17:29:25] (03PS35) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [17:29:28] (03PS1) 10Ottomata: eventgate-analytics - 0.0.20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/499264 [17:29:45] (03PS4) 10Cwhite: hiera: upgrade prometheus-node-exporter to 0.17 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/490690 (https://phabricator.wikimedia.org/T213708) [17:29:47] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - 0.0.20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/499264 (owner: 10Ottomata) [17:29:50] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Traffic, 10Patch-For-Review: Enable nginx prometheus metrics for all elastic nodes - https://phabricator.wikimedia.org/T216681 (10Mathew.onipe) [17:30:09] RECOVERY - Check Varnish expiry mailbox lag on cp3036 is OK: OK: expiry mailbox lag is 0 [17:30:35] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: upgrade prometheus-node-exporter to 0.17 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/490690 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [17:31:11] (03CR) 10Cwhite: [C: 03+2] hiera: upgrade prometheus-node-exporter to 0.17 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/490690 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [17:31:26] (03CR) 10CRusnov: [C: 03+2] Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [17:31:45] (03PS15) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [17:31:52] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [17:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:54] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [17:31:54] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:28] (03PS36) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [17:33:23] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --dry-run --debug stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [17:33:24] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [17:33:24] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:48] (03CR) 10Mholloway: [C: 04-1] "Should we use the local DB for testwikidatawiki? We don't want to pollute the "real" tables with testwikidatawiki data..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [17:35:04] (03CR) 10Herron: [C: 03+1] "Looks reasonable to me. Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/498417 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [17:35:19] (03PS8) 10Andrew Bogott: puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 [17:35:28] right [17:35:39] a completely 100% wasted day >_< [17:35:50] 110% wasted [17:35:53] so done... [17:36:39] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 (owner: 10Andrew Bogott) [17:38:11] (03CR) 10Jcrespo: [C: 03+1] db-codfw.php: Change parsercache key (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498322 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [17:38:35] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --dry-run --debug stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [17:38:36] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [17:38:36] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:36] !log Updated Parsoid to f58c3d1 (T219023) [17:38:37] (03CR) 10Ppchelko: "1+1=2??? Anyone? :)" [puppet] - 10https://gerrit.wikimedia.org/r/498417 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [17:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:42] T219023: Source editor and VisualEditor force piped-link creation for links when using lower-case first character and more than one word - https://phabricator.wikimedia.org/T219023 [17:38:50] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [17:38:51] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [17:38:51] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:12] (03PS16) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [17:39:20] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --dry-run --debug stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [17:39:21] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [17:39:22] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:57] (03PS12) 10Arturo Borrero Gonzalez: WMCS: introduce sssd, replacing nscd/nslcd [puppet] - 10https://gerrit.wikimedia.org/r/498359 (https://phabricator.wikimedia.org/T218126) [17:41:47] (03CR) 10Arturo Borrero Gonzalez: WMCS: introduce sssd, replacing nscd/nslcd (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/498359 (https://phabricator.wikimedia.org/T218126) (owner: 10Arturo Borrero Gonzalez) [17:41:54] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [17:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:57] (03PS2) 10Thcipriani: Bump submodules for 2.15.12 release [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/499256 [17:41:57] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [17:41:57] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:52] (03CR) 10Bstorm: [C: 03+1] "Let's merge and see how it goes." [puppet] - 10https://gerrit.wikimedia.org/r/498359 (https://phabricator.wikimedia.org/T218126) (owner: 10Arturo Borrero Gonzalez) [17:43:27] (03PS4) 10Jcrespo: Add WikimediaEditorTasks tables to the private tables list [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) (owner: 10Mholloway) [17:43:31] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-codfw-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [17:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:34] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [17:43:34] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:20] (03CR) 10Jcrespo: "I am trying to deploy this and not to annoy people, but puppet pipeline seems to be overloaded today." [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) (owner: 10Mholloway) [17:44:24] (03CR) 10Jcrespo: [C: 03+2] Add WikimediaEditorTasks tables to the private tables list [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) (owner: 10Mholloway) [17:45:12] (03PS13) 10Arturo Borrero Gonzalez: WMCS: introduce sssd, replacing nscd/nslcd [puppet] - 10https://gerrit.wikimedia.org/r/498359 (https://phabricator.wikimedia.org/T218126) [17:45:16] (03CR) 10Paladox: [C: 03+2] Bump submodules for 2.15.12 release [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/499256 (owner: 10Thcipriani) [17:45:35] (03CR) 10Mholloway: [C: 04-1] "Or does prefixing prevent that from being an issue?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [17:45:37] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:45:37] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-eqiad-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [17:45:38] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [17:45:38] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] WMCS: introduce sssd, replacing nscd/nslcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498359 (https://phabricator.wikimedia.org/T218126) (owner: 10Arturo Borrero Gonzalez) [17:46:25] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) a:03Krenair I'm planning to have a go at this soon. [17:47:53] (03PS17) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [17:48:02] (03PS14) 10Arturo Borrero Gonzalez: WMCS: introduce sssd, replacing nscd/nslcd [puppet] - 10https://gerrit.wikimedia.org/r/498359 (https://phabricator.wikimedia.org/T218126) [17:48:09] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:48:11] (03PS3) 10Thcipriani: Bump submodules for 2.15.12 release [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/499256 [17:48:13] (03PS9) 10Andrew Bogott: puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 [17:48:15] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10RobH) a:05RobH→03Papaul [17:48:18] (03PS1) 10Andrew Bogott: cloud puppetmaster: Duplicate some hiera settings from 'main' to 'eqiad1' [puppet] - 10https://gerrit.wikimedia.org/r/499267 [17:48:33] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Bump submodules for 2.15.12 release [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/499256 (owner: 10Thcipriani) [17:49:47] (03CR) 10Andrew Bogott: "These aren't consumed anywhere at the moment, but one of them is about to be consumed by upcoming puppet-merge changes." [puppet] - 10https://gerrit.wikimedia.org/r/499267 (owner: 10Andrew Bogott) [17:50:24] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 (owner: 10Andrew Bogott) [17:51:54] (03CR) 10Volans: "It looks sane to me, but I'd wait that we merge the Ganeti module before moving forward, apart for the clear dependency but also because i" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [17:54:20] (03PS2) 10Giuseppe Lavagetto: arclamp: make arclamp-grep work with excimer logs as well [puppet] - 10https://gerrit.wikimedia.org/r/499219 (https://phabricator.wikimedia.org/T176916) [17:54:22] (03PS3) 10Giuseppe Lavagetto: arclamp: abstract arclamp::instance out of arclamp [puppet] - 10https://gerrit.wikimedia.org/r/499220 (https://phabricator.wikimedia.org/T176916) [17:54:24] (03PS3) 10Giuseppe Lavagetto: arclamp: remove previously absented files [puppet] - 10https://gerrit.wikimedia.org/r/499221 [17:54:26] (03PS3) 10Giuseppe Lavagetto: arclamp: add a second instance for excimer logs [puppet] - 10https://gerrit.wikimedia.org/r/499222 (https://phabricator.wikimedia.org/T176916) [17:55:03] (03PS1) 10Thcipriani: Gerrit 2.15.12 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/499269 [17:57:18] (03CR) 10Paladox: [C: 03+2] Gerrit 2.15.12 release (031 comment) [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/499269 (owner: 10Thcipriani) [17:58:13] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [17:59:31] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190326T1800) [18:01:11] !log starting gerrit 2.15.12 upgrade [18:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:56] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Gerrit 2.15.12 release [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/499269 (owner: 10Thcipriani) [18:03:45] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@d3d2134]: Gerrit to 2.15.12 on gerrit2001 only [18:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:49] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv6: Connect, AS1299/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:03:56] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@d3d2134]: Gerrit to 2.15.12 on gerrit2001 only (duration: 00m 11s) [18:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:21] (03CR) 10Gergő Tisza: [C: 03+1] "There's no prefixing in production, since we use per-wiki databases. Normal users won't have wikidatawiki edits anyway, I presume; and usi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [18:04:45] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@d3d2134]: Gerrit to 2.15.12 on cobalt [18:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:01] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@d3d2134]: Gerrit to 2.15.12 on cobalt (duration: 00m 15s) [18:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:17] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [18:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:19] !log restarting gerrit on cobalt for update to 2.15.12 [18:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:13] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] [18:08:26] ^ because of ganeti afaict [18:08:55] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 3 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [18:09:05] "Powered by Gerrit Code Review (2.15.12-54-g49865603da)". LGTM. [18:09:18] er, gerrit :\ [18:09:24] g names everwhere. [18:09:25] makes sense [18:09:35] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [18:09:37] !log gerrit back on version 2.15.12, upgrade complete. [18:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:52] thcipriani: Thanks! [18:09:55] (03CR) 10Hashar: "Yes that is still relevant but has to be rebased. IIRC that caused a some Selenium tests targeting beta to magically fail :/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar) [18:10:15] Ldap groups should now be fixed in PolyGerrit ui! [18:11:07] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] [18:11:12] (03PS2) 10Urbanecm: Add throttle rule for Czech editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499231 (https://phabricator.wikimedia.org/T219291) [18:12:06] paladox: Yup, https://gerrit.wikimedia.org/r/admin/groups/11,members now shows "ldap/wmf" in PolyGerrit UI. [18:13:27] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:14:28] :) [18:15:03] (03PS1) 10Urbanecm: Clean the throttles up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499287 (https://phabricator.wikimedia.org/T219311) [18:16:38] (03PS1) 10CRusnov: netbox ganeti sync: Fix path to logfiles. [puppet] - 10https://gerrit.wikimedia.org/r/499288 [18:17:40] (03PS1) 10Paladox: Revert "gerrit: Disable jgit gc" [puppet] - 10https://gerrit.wikimedia.org/r/499289 [18:20:28] (03PS2) 10CRusnov: netbox ganeti sync: Fix path to logfiles. [puppet] - 10https://gerrit.wikimedia.org/r/499288 [18:21:27] !log starting branch cut for 1.33.0-wmf.23 (T206677) [18:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:30] T206677: 1.33.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T206677 [18:22:11] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 2.397e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [18:22:22] (03PS14) 10Urbanecm: Initial configuration for hiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498317 (https://phabricator.wikimedia.org/T218155) [18:24:41] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:24:48] !log reloading db1124 mariadb instances to reload and check filters [18:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:31] (03CR) 10Thcipriani: [C: 03+1] "Now that gerrit has been upgraded to 2.15.12 and contains fixes for https://bugs.eclipse.org/bugs/show_bug.cgi?id=544199 we should be read" [puppet] - 10https://gerrit.wikimedia.org/r/499289 (owner: 10Paladox) [18:26:59] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 19, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:31:43] 10Operations, 10WMF-NDA-Requests: Volunteer NDA for Alex Monk - https://phabricator.wikimedia.org/T218448 (10Krenair) 05Open→03Resolved It works, thanks. [18:34:02] (03CR) 10Aaron Schulz: [C: 03+1] db-codfw.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498322 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [18:35:11] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect, AS1299/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:35:19] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 73 probes of 401 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:39:25] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:39:47] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:40:29] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:40:35] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 14 probes of 401 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:43:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10cloud-services-team (Kanban): decommission labvirt101[01].eqiad.wmnet (Dec 2018 lease return) - https://phabricator.wikimedia.org/T210735 (10RobH) Please note these systems still need their SSDs securely erased per https://wikitech.wikimedia.org/wiki... [18:43:52] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10RobH) Please note these systems still need their SSDs securely erased per https://wikitech.wikimedia.org/wiki/Dc-operatio... [18:44:32] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) It seems that we found the root cause of... [18:46:44] (03PS1) 10Rush: apache: simplify naming scheme for administrative [puppet] - 10https://gerrit.wikimedia.org/r/499295 [18:47:08] (03CR) 10Mholloway: [C: 04-1] "To memorialize discussion on IRC: we'll create on x1 for testwikidatawiki so that we can test under shared DB conditions, then clear the d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [18:49:04] !log branch 1.33.0-wmf.23 was cut successfully (T206677) [18:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:07] T206677: 1.33.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T206677 [18:49:37] (03CR) 10Rush: [V: 03+2 C: 03+2] apache: simplify naming scheme for administrative [puppet] - 10https://gerrit.wikimedia.org/r/499295 (owner: 10Rush) [18:50:47] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:51:27] (03CR) 10Jcrespo: "> clear the description_exists table" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [18:51:39] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 19, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:54:09] (03CR) 10Mholloway: [C: 04-1] "> Let's not ask for too much red tape. We normally ask for a script" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [19:00:04] marxarelli: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190326T1900). [19:03:05] (03CR) 10KartikMistry: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [19:04:34] !log reloading db1125 mariadb instances to reload and check filters [19:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:57] (03PS2) 10Vgutierrez: redirects.dat: Get rid of domains non controlled by WMF [puppet] - 10https://gerrit.wikimedia.org/r/499239 (https://phabricator.wikimedia.org/T213705) [19:06:04] (03PS1) 10Dduvall: group0 to 1.33.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499303 [19:07:50] !log dduvall@deploy1001 clean aborted: Pruned MediaWiki: 1.33.0-wmf.17 (duration: 00m 10s) [19:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:05] (03PS1) 10Mforns: Remove all references to Wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) [19:13:09] (03PS4) 10Hashar: scap: add logging to clean > prune-git-branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497781 (https://phabricator.wikimedia.org/T218783) [19:13:34] !log reloading db2094 mariadb instances to reload and check filters [19:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:47] !log reloading db2095 mariadb instances to reload and check filters [19:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:52] !log scap clean failure due to T218783. train is rolling without cleanup [19:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:55] T218783: `scap clean` failure - https://phabricator.wikimedia.org/T218783 [19:20:00] (03PS3) 10DannyS712: Remove the ability of non-administrators to move category pages on the English Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) [19:20:19] !log dduvall@deploy1001 Started scap: testwiki to php-1.33.0-wmf.23 and rebuild l10n cache [19:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:22] (03PS4) 10DannyS712: Remove the ability of non-administrators to move category pages on the English Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) [19:21:02] (03CR) 10Jcrespo: [C: 03+2] "Filters have been deployed to all sanitariums by reloading them." [puppet] - 10https://gerrit.wikimedia.org/r/499191 (https://phabricator.wikimedia.org/T218302) (owner: 10Mholloway) [19:23:53] (03Abandoned) 10GTirloni: wmcs: Add .py extension to various scripts [puppet] - 10https://gerrit.wikimedia.org/r/498379 (https://phabricator.wikimedia.org/T144169) (owner: 10GTirloni) [19:25:09] Is there any issue with article import? [19:25:12] (03CR) 10Pppery: [C: 04-1] "Right needs to be granted to bots too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712) [19:28:07] Seems on logstash, XJp8agpAIC8AACoxLFcAAAAS - will report if persists. [19:42:12] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Traffic, 10Performance-Team-notice: Expires header for load.php should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657 (10Krinkle) 05Open→03Resolved #### Startup request rate pattern `name=... [19:42:44] (03PS1) 10Smalyshev: Actually load WBCS-Lexeme extension before trying to use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499309 (https://phabricator.wikimedia.org/T216206) [19:44:13] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [19:45:48] (03CR) 10Jforrester: "Ha." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499309 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [19:46:28] (03CR) 10Smalyshev: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499309 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [19:49:00] 10Operations, 10Patch-For-Review: Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10Andrew) [19:50:05] RECOVERY - Mjolnir bulk update failure check - eqiad on icinga1001 is OK: (C)2 gt (W)1 gt 0 https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [19:53:23] PROBLEM - Mjolnir bulk update failure check - codfw on icinga1001 is CRITICAL: 1.288e+06 gt 2 https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [19:53:42] (03PS10) 10Andrew Bogott: puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 [19:54:33] (03PS2) 10Andrew Bogott: cloud puppetmaster: Duplicate some hiera settings from 'main' to 'eqiad1' [puppet] - 10https://gerrit.wikimedia.org/r/499267 [19:55:20] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 (owner: 10Andrew Bogott) [19:55:39] (03CR) 10Andrew Bogott: "after some discussion on IRC we've agreed that we're better off not cluttering up the actual merge script, so I've wound this back a few p" [puppet] - 10https://gerrit.wikimedia.org/r/497069 (owner: 10Andrew Bogott) [19:56:44] (03CR) 10Andrew Bogott: [C: 03+2] cloud puppetmaster: Duplicate some hiera settings from 'main' to 'eqiad1' [puppet] - 10https://gerrit.wikimedia.org/r/499267 (owner: 10Andrew Bogott) [19:58:18] !log dduvall@deploy1001 Finished scap: testwiki to php-1.33.0-wmf.23 and rebuild l10n cache (duration: 37m 59s) [19:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:58] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and cable frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T213104 (10Jgreen) @cmjohnson would it be possible to get this rolling soon? [20:08:37] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:10:11] (03CR) 10Dduvall: [C: 03+2] group0 to 1.33.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499303 (owner: 10Dduvall) [20:11:24] (03Merged) 10jenkins-bot: group0 to 1.33.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499303 (owner: 10Dduvall) [20:12:05] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10bd808) [20:12:17] (03CR) 10jenkins-bot: group0 to 1.33.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499303 (owner: 10Dduvall) [20:15:33] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.33.0-wmf.0 [20:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:46] (03PS1) 10Andrew Bogott: Rename labvirt1008 to cloudvirt1008 [puppet] - 10https://gerrit.wikimedia.org/r/499316 [20:19:47] !log correction: group0 to 1.33.0-wmf.23 [20:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:26] (03PS1) 10Andrew Bogott: test patch -- do not merge [puppet] - 10https://gerrit.wikimedia.org/r/499317 [20:23:14] (03CR) 10jerkins-bot: [V: 04-1] test patch -- do not merge [puppet] - 10https://gerrit.wikimedia.org/r/499317 (owner: 10Andrew Bogott) [20:24:14] (03PS2) 10Andrew Bogott: test patch -- do not merge [puppet] - 10https://gerrit.wikimedia.org/r/499317 [20:24:18] (03Abandoned) 10Andrew Bogott: test patch -- do not merge [puppet] - 10https://gerrit.wikimedia.org/r/499208 (owner: 10Andrew Bogott) [20:24:56] (03CR) 10jerkins-bot: [V: 04-1] test patch -- do not merge [puppet] - 10https://gerrit.wikimedia.org/r/499317 (owner: 10Andrew Bogott) [20:29:35] (03PS1) 10Ottomata: eventgate-analytics - default monitoring.enabled to true [deployment-charts] - 10https://gerrit.wikimedia.org/r/499320 [20:29:51] (03PS3) 10Andrew Bogott: test patch -- do not merge [puppet] - 10https://gerrit.wikimedia.org/r/499317 [20:30:51] (03CR) 10jerkins-bot: [V: 04-1] test patch -- do not merge [puppet] - 10https://gerrit.wikimedia.org/r/499317 (owner: 10Andrew Bogott) [20:35:47] (03PS4) 10Andrew Bogott: test patch -- do not merge [puppet] - 10https://gerrit.wikimedia.org/r/499317 [20:36:48] (03CR) 10jerkins-bot: [V: 04-1] test patch -- do not merge [puppet] - 10https://gerrit.wikimedia.org/r/499317 (owner: 10Andrew Bogott) [20:44:21] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [20:46:51] PROBLEM - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.380 second response time https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [21:00:05] mdholloway: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Enable WikimediaEditorTasks on testwikidatawiki deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190326T2100). [21:00:37] i'll wait on this until the train deploy finishes, of course ^ [21:00:39] !log depooled wdqs1006 to see if it'd catch up better [21:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:38] er, wait, maybe i misread earlier [21:01:41] marxarelli: train done? [21:02:12] yes! [21:02:34] all done. smooth rollout today [21:02:50] awesome, thanks! [21:02:54] np [21:04:28] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Services (watching), 10User-ArielGlenn: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10mobrovac) [21:05:46] PROBLEM - ElasticSearch health check for frozen writes - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch / cirrus frozen writes: 3:08:56.913049 https://wikitech.wikimedia.org/wiki/Search%23Pausing_Indexing [21:06:20] mmmh, there was an ES6 upgrade today, gehel, onimisionipe expected or issue? [21:07:05] RECOVERY - ElasticSearch health check for frozen writes - 9643 on search.svc.codfw.wmnet is OK: OK - elasticsearch / cirrus frozen writes: no freeze https://wikitech.wikimedia.org/wiki/Search%23Pausing_Indexing [21:07:42] * apergos peeks in [21:15:15] I don't think the command given in https://wikitech.wikimedia.org/wiki/Search#Monitoring_the_job_queue works [21:15:24] unless mwmaint is not the place I should be running that [21:15:37] s/mwmaint/&1002/ [21:18:17] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:18:33] <_joe_> cdanis: no, that shouldn't really work anymore [21:18:57] it is linked to by what the alert links to :) [21:19:16] <_joe_> I know [21:19:31] <_joe_> if it wasn't past 10 pm, I would've maybe fixed that [21:19:33] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:19:45] <_joe_> you can't use that command with the "new" kafka queue [21:19:48] I'd file a task to fix it but I am not even sure what it should say [21:20:50] (03CR) 10Bartosz Dziewoński: [C: 03+1] Wikimaniawiki: Enable visual editor in 2019 namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497682 (https://phabricator.wikimedia.org/T218645) (owner: 10Ammarpad) [21:21:39] <_joe_> cdanis: ftr, where you should look is https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1 [21:21:47] <_joe_> and look at cirrus jobs [21:21:59] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [21:22:06] !log created new db tables for WikimediaEditorTasks in x1 [21:22:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:16] _joe_: the dangerous thing about indexing being 'frozen' is too large a kafka queue, is that correct? [21:22:30] PROBLEM - ElasticSearch health check for frozen writes - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch / cirrus frozen writes: 3:25:40.173948 https://wikitech.wikimedia.org/wiki/Search%23Pausing_Indexing [21:23:12] PROBLEM - ElasticSearch health check for frozen writes - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch / cirrus frozen writes: 3:26:22.106952 https://wikitech.wikimedia.org/wiki/Search%23Pausing_Indexing [21:23:43] I'm here too, looking [21:23:43] <_joe_> cdanis: no it's stale search results first [21:23:49] I'm asking people in -discovery [21:23:50] RECOVERY - ElasticSearch health check for frozen writes - 9643 on search.svc.codfw.wmnet is OK: OK - elasticsearch / cirrus frozen writes: no freeze https://wikitech.wikimedia.org/wiki/Search%23Pausing_Indexing [21:23:55] (03PS2) 10Mholloway: Enable WikimediaEditorTasks on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) [21:24:10] volans: thanks [21:24:32] RECOVERY - ElasticSearch health check for frozen writes - 9243 on search.svc.codfw.wmnet is OK: OK - elasticsearch / cirrus frozen writes: no freeze https://wikitech.wikimedia.org/wiki/Search%23Pausing_Indexing [21:25:08] I don't understand why it flaps between frozen for 3:26:22.106952 and no freeze [21:25:33] <_joe_> I don't think that number means what you think [21:25:44] no luck so far in -discovery [21:25:45] <_joe_> also I have no idea what the check actually checks [21:25:53] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [21:25:53] (03CR) 10Smalyshev: [C: 03+1] add Icinga notes_url to various NRPE monitor checks, pt 2 [puppet] - 10https://gerrit.wikimedia.org/r/499148 (owner: 10Dzahn) [21:26:03] <_joe_> ok time to call people on the phone I guess [21:26:25] looking [21:26:27] got erik [21:26:44] check_cirrus_frozen_writes.py [21:26:48] (03CR) 10Mholloway: [C: 03+2] Enable WikimediaEditorTasks on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [21:27:10] backlog for elasticawrite looks like started to climb at around 18:00 https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&from=1553593294506&to=1553635476797&var-site=eqiad&var-type=All [21:27:59] the code there looks like it is structured as volans says [21:28:27] (03Merged) 10jenkins-bot: Enable WikimediaEditorTasks on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [21:28:28] hmm, very possibly related to 10k pages i started reindexing for commonswiki, stopped that [21:28:51] but not the freeze part, if that was frozen it's unexepected [21:28:57] but i'm not finding it frozen... [21:29:30] https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&from=now-7d&to=now&panelId=5&fullscreen&var-site=eqiad&var-type=All [21:29:38] yeah it seems it's flapping between frozen (with that number above that seems to indicate 3h and a bit) and no freeze at all [21:29:39] cirrusSearchCheckerJob elevated over the past week [21:29:49] cirrusSearchElasticaWrite elevated today [21:29:57] <_joe_> cdanis: thats normal (SearchChecker [21:30:39] ok, 9643 is frozen but shouldn't be, [21:30:49] having solr flashbacks looking at these graphs [21:31:20] checker job and elasticawrite goes together, checker reindexes 1/8 of all pages every week, and a new two week loop started a couple days ago. completely normal [21:31:37] I see the es6 upgrade cookbook failed at 18:05, maybe it operated on that instance and didn't thaw? https://tools.wmflabs.org/sal/log/AWm7LYk8Im9Dp5A3mCLN [21:31:48] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable WikimediaEditorTasks on testwikidatawiki (duration: 00m 57s) [21:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:12] !log manually thaw search.svc.codfw.wmnet:9643 [21:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:23] <_joe_> godog: seems to make sense [21:32:35] yea, sounds like the cookbook didn't manage to clean up properly [21:33:16] not sure why that alert would flap though, it should have stayed constantly on [21:33:31] a theory heh, I'm reading the cookbook [21:33:35] ebernhardson: from logs it failed with [21:33:36] spicerack.elasticsearch_cluster.ElasticsearchClusterError: Encountered error while deleting document to unfreeze cluster writes [21:33:43] heh [21:33:50] elasticsearch.exceptions.NotFoundError: TransportError(404, '{"_index":"mw_cirrus_metastore_1527120632","_type":"mw_cirrus_metastore","_id":"freeze-everything","_version":1,"result":"not_found","_shards":{"total":3,"successful":3,"failed":0},"_seq_no":0,"_primary_term":20}') [21:34:11] not found? thats surprising since we found it still frozen ... [21:34:21] if they freeze/unfreeze very quickly i suppose that could happen [21:34:38] (es only flushes writes every 30s) [21:34:45] in the same stacktrace there was also [21:34:45] elasticsearch.exceptions.NotFoundError: TransportError(404, '{"_index":"mw_cirrus_metastore_first","_type":"mw_cirrus_metastore","_id":"freeze-everything [21:34:48] ","_version":1,"result":"not_found","_shards":{"total":3,"successful":3,"failed":0},"_seq_no":0,"_primary_term":4}') [21:35:00] note the different _index [21:35:30] volans: thats expected, everything we do is against an alias, but when elasticsearch responds it gives the real name behind the scenes [21:36:00] volans: oh, you mean not the same name ... hmm. The six clusters can all have different names behind the scenes [21:36:12] those logs need to record what cluster they were talking to ... [21:37:13] 9443 and 9643 in codfw have mw_cirrus_metastore_first, 9243 has mw_cirrus_metastore_1527120632 [21:37:36] from a bit above I guess search.svc.codfw.wmnet, 9243, 9443, 9643 [21:37:54] eh [21:38:03] which one of those, good question [21:38:03] i'll document all this and make a ticket, since it's unfrozen now across all the clusters there isn't anything in particular to worry about, it's more just fixing the scripts to work properly [21:39:17] to understand teh alert better -- it queries the cluster and asks how long it has been frozen for? are there any alerts around how backlogged kafka thinks the queue is? [21:40:01] as for the docs on monitoring the job queue ... thats for the old redis job queue :( the new job queue doesn't seem to have reimplemented that funtionality [21:40:11] 10Operations, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Doing), and 3 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10akosiaris) [21:40:21] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 3 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10akosiaris) 05Open→03Resolved Resolving, feel free to reopen [21:40:57] cdanis: for the query, it asks elasticsearch for a specially named document that, if it exists, means nothing should write to the cluster (applications/mediawiki must check and respect this). That document contains a timestamp that says when the freeze started [21:41:27] ebernhardson: a bunch of elastic host remained with puppet disabled [21:41:33] but I'm not sure if it's ok to re-enable it [21:41:44] cdanis: so the check looks at that timestamp and alerts if it's been longer than some predefined limit, to catch instances like this were it wasn't thaw'd. When frozen the job queue of writes simply backs up [21:42:00] okay, I think I got most of that [21:42:15] the cookbook did re-enable it on some host (from the logs) but probably didn't finish because of the exception [21:42:39] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:42:41] volans: a major version upgrade is in progress, sadly i'm not sure where exactly gehel left it so i cant say if puppet can be restarted [21:43:19] yeah I know, from debmonitor all hosts have 6.5.4 for elasticsearch-oss [21:43:29] fwiw, but I dunno the current state [21:43:39] yeah the backlog of cirrusSearchElasticaWrite seems to be growing slower now but still growing [21:43:46] (03CR) 10jenkins-bot: Enable WikimediaEditorTasks on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499227 (https://phabricator.wikimedia.org/T218136) (owner: 10Mholloway) [21:43:48] can you get a running elasticsearch instance to report its version somehow? [21:44:01] cdanis: query the http root [21:44:32] cdanis: like http://elastic2040.codfw.wmnet:9200/ [21:45:13] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:46:10] ebernhardson: elastic[2031,2033-2036,2045,2047,2049-2054].codfw.wmnet have "number" : "5.6.14", [21:46:11] sadly elasticsearch-exporter doesn't seem to be exporting the version as a metric, afaics [21:46:13] https://phabricator.wikimedia.org/P8277 [21:46:16] all the rest 6.5.4 [21:46:32] cdanis: lol [21:47:05] yep confirmed the list of 5.6.14 matches the disabled puppet [21:47:10] so I will not touch them [21:47:14] (03PS5) 10DannyS712: Restrict the ability to move category pages on the English Wikipedia to administrators, page movers ('extendedmover') and bots. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) [21:47:21] ok [21:47:59] !log depool wdq2003 to catch it up [21:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:42] so it sounds like -- search in a state where gehel needs to clean it up, but not at risk as a service right now? [21:48:55] godog: where do you see the job queue backlog? [21:49:11] ebernhardson: I'm looking at https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&panelId=15&fullscreen&from=1553615340716&to=1553636940716&var-site=eqiad&var-type=All [21:49:55] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:50:23] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:50:24] godog: got it. The size of that backlog is a little surprising for me since this was only frozen on 9643 which is a cluster of tiny wikis (~20M docs across all wikis on that cluster). 4 hours shouldn't have had that many updates ... will look into that too [21:50:59] we might still have problems with how the retries there work [21:51:01] ebernhardson: ack, thanks! [21:51:08] damn, I'm late to the party [21:51:13] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:51:30] gehel: tl/dr, codfw:9643 didn't unfreeze properly and eventually alerted [21:51:39] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:51:51] ebernhardson: did you unfreeze it? [21:51:55] gehel: yes [21:52:04] thanks! [21:52:25] i didn't notice the alert for about a half hour though, which worried some people :) [21:52:36] I did check it before leaving, but my curl was probably wrong :( [21:52:45] PROBLEM - ElasticSearch health check for frozen writes - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch / cirrus frozen writes: 3:55:55.001598 https://wikitech.wikimedia.org/wiki/Search%23Pausing_Indexing [21:52:49] uhh [21:52:55] hello again [21:53:22] oops, looks like 9643 wasnt alone [21:53:31] i checked all the clusters :S [21:53:57] i can't read ... 9243 was frozen too :S [21:54:02] RECOVERY - ElasticSearch health check for frozen writes - 9243 on search.svc.codfw.wmnet is OK: OK - elasticsearch / cirrus frozen writes: no freeze https://wikitech.wikimedia.org/wiki/Search%23Pausing_Indexing [21:54:22] gehel: actually, it should have unfroze. I ran: for port in 9243 9443 9643; do curl -s -XDELETE https://search.svc.codfw.wmnet:$port/mw_cirrus_metastore/mw_cirrus_metastore/freeze-everything | jq .; done [21:54:29] what wikis does 9243 back? [21:54:37] cdanis: 9243 is all the big wikis [21:55:16] cdanis: basically each cluster has 300 wikis, 9443 and 9643 are tiny clusters that hold the smallest 600 wikis to keep all that extra state out of the primary cluster (9243) [21:55:28] sure sure [21:55:48] (03CR) 10Pppery: [C: 03+1] Restrict the ability to move category pages on the English Wikipedia to administrators, page movers ('extendedmover') and bots. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712) [21:56:04] so the flapping ... is something re-freezing? [21:56:19] not that I know, the cookbook is not running atm [21:56:53] incoherent replicas and different answer depending on the node? [21:57:36] so that service IP goes to multiple backends yes? [21:57:43] maybe something weird with the mixed major version cluster ... i see 9643 frozen doc back again [21:57:43] cdanis: yes [21:57:46] and it has the old timestamp [21:57:53] I just got frozen from one of the 9643 backends [21:57:55] "reason": "ES6 upgrade - gehel@cumin2001 - T218878", [21:57:56] T218878: Upgrade to elasticsearch 6.5.4 for cirrus / codfw - https://phabricator.wikimedia.org/T218878 [21:58:03] and from one of the 9243 [21:58:11] ebernhardson: that would be bad! [21:58:14] how is that frozen document replicated between servers? [21:58:39] cdanis: elasticsearch is managing the replication [21:58:57] *is* it? :D [21:59:19] s/is/should be/ ;) [22:00:00] elastic 6 does change replication a bit, but primarily it is adding transaction id's so it can do replays instead of full index copies [22:00:27] for funsies, the next version of elastic completely replaces the replication protocol with a formally proven thing. should be even more fun :P [22:00:34] heh heh [22:01:08] nothing obvious in the logs [22:01:48] PROBLEM - ElasticSearch health check for frozen writes - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch / cirrus frozen writes: 4:04:59.015935 https://wikitech.wikimedia.org/wiki/Search%23Pausing_Indexing [22:01:51] 10Operations, 10cloud-services-team: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10Volans) [22:02:00] 10Operations, 10cloud-services-team: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10Volans) p:05Triage→03High [22:02:05] clearly it is not replicating this document correctly [22:02:06] so, probably we are going to loose some updates [22:02:34] worst case, we have some data corruption on cirrus_metastore, which could mean we also have corruption on other indices [22:02:48] 10Operations, 10cloud-services-team: apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10Volans) [22:02:55] gehel: at some point in the backlog there were hosts in codfw still running 5.6 btw, expected? the ones with puppet stopped [22:03:02] (03CR) 10Jbond: Add prometheus interface to spicerack (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [22:03:06] RECOVERY - ElasticSearch health check for frozen writes - 9243 on search.svc.codfw.wmnet is OK: OK - elasticsearch / cirrus frozen writes: no freeze https://wikitech.wikimedia.org/wiki/Search%23Pausing_Indexing [22:03:07] gehel: not a big deal, i can run a reindex over the 12 hours or whatever, and the checker will over time ensure everything is sane [22:03:17] the question is how this darn freeze doc keeps coming back ... [22:03:23] godog: yep, expected, the cluster is currently in a mixed state, still 12 nodes to go [22:03:39] any objections to me sending that DELETE on the special freeze document to all elastic nodes? [22:03:49] at the end of the day, this cluster is not serving user traffic right now. We could probably silence the alert, [22:03:51] cdanis: please do [22:04:03] cdanis: you can, but all nodes will re-route it to the single master for the index [22:04:13] ahh [22:04:28] ebernhardson: how does that work for nodes that have a replica? [22:04:37] gehel: replicates from the master to the replica [22:04:48] gehel: ack, thanks! [22:05:09] gehel: so all update requests get routed over the internode transport to the master of the appropriate index shard, and then the index shard sends it out to the replicas [22:05:25] so, should I restart the upgrade to reduce the time in mixed state, or wait until tomorrow so we investigate more [22:05:51] ebernhardson: no shortcut when the client node is also hosting a replica? [22:05:56] gehel: the problem is going to be that job queue is backing up for 9243 [22:06:08] I think the big concern now is that the job queue is still rising [22:06:10] yeah, that [22:06:10] gehel: if it was only 9643 it wouldn't matter ... i could leave a tmux deleting the doc from 9243 regularly [22:06:31] 12 nodes to upgrade are still going to take some time [22:06:33] gehel: or really, you should so you can kill it tomorrow. You'd be up all night to finish the upgrade [22:07:05] gehel: while true; sleep 30; for port in 9243 9443 9643; do curl -s -XDELETE https://search.svc.codfw.wmnet:$port/mw_cirrus_metastore/mw_cirrus_metastore/freeze-everything; done; done [22:07:22] i dunno if thats great ... but it will let you sleep i hope :) [22:07:26] k:) [22:07:32] sleeping is good :) [22:07:52] we will just reindex over the known bad time period once upgrade is done [22:07:57] * gehel is wondering if re-creating that doc would help [22:08:06] gehel: couldn't hurt [22:08:13] but updates are actually append, so probably won't make any difference [22:08:36] ebernhardson: you happen to have that curl somewhere? [22:09:03] * gehel is extracting it from a cookbook [22:10:20] gehel: yea pull from cookbook, or mwscript extensions/CirrusSearch/maintenance/freezeWritesToCluster.php, have to choose three appropriate wikis one for each cluster [22:10:32] gehel: ping if I can help [22:10:39] it would be neat if the http replies from the service IP had a response header indicating which backing server they came from [22:11:27] lemme see if i can recreate the metastore index maybe ... our scripts might not want to though since nothing about the schema changed [22:12:10] yea it doesn't want to, it says the metastore schema is correct [22:12:48] !log freezing and unfreezing writes to elasticsearch codfw [22:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:30] sorry, just catching up, the frozen doc cannot be deleted? [22:13:38] dcausse: we delete it, and it comes back [22:13:59] 10Operations, 10cloud-services-team (Kanban): apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10bd808) [22:14:02] the cookbook is stopped? [22:14:05] it's an immortal doc [22:14:18] dcausse: yep, cookbook stopped [22:14:33] it's also coming back with the old timestamp, which suggests the cookbook isn't re-running [22:14:56] I've re-created with a new timestamp [22:15:02] so we'll see if this changes anything [22:15:26] so far looks good [22:16:16] I have go to offline now, apologies but not sure I was being much help anyway. good luck [22:16:34] https://www.youtube.com/watch?v=5UT8RkSmN4k [22:16:39] cdanis: thanks! [22:16:42] i'll leave a watch running in a shell on screen to fetch the freeze from all codfw clusters, hopefully notice if it returns [22:18:16] I'm going as well! [22:18:24] we still need to understand how that failed [22:19:32] indeed, deleted things returning in a database could be a pretty major problem if it's not a mixed-major version issue [22:21:58] even if it is a mixed-major version issue, we might review our upgrade process [22:22:52] yea, the option to do a major version upgrade without shutting everything down seemed much better than the alternative, but only if it actually works.... [22:23:13] and testing that is going to not be easy! [22:25:37] perhaps related? https://github.com/elastic/elasticsearch/issues/31976 [22:29:38] dcausse: hmm, claims the fix was shipped in 6.3.0, [22:29:56] but maybe 5.x has the bug [22:30:08] looking at the extended logs of the cookbook, it looks like the delete failed with a 404 [22:30:26] yeah that's whatI reported earlier [22:30:41] but had too little context to make sense of it [22:30:45] * gehel did not read backlog :( [22:32:05] no worries, you were too busy looking at things [22:32:41] I'm not going to solve this tonight, I need some sleep and things seem to be under control [22:32:48] ping / ring me if needed! [22:32:51] +1 [22:51:41] ebernhardson, hi, you in the middle of something? [22:59:02] Krenair: i havea minute [22:59:24] ebernhardson, I was wondering what you knew about this: [22:59:27] hieradata/role/common/elasticsearch/relforge.yaml:profile::elasticsearch::cirrus::ferm_srange: '(($CUMIN_MASTERS $LABS_NETWORKS @resolve((contint1001.wikimedia.org contint2001.wikimedia.org))))' [22:59:41] interesting selection of hosts allowed [22:59:55] Krenair: relforge is a testing platform that lives between prod and cloud [22:59:59] sure [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190326T2300). [23:00:05] SMalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:08] I was particularly interested in CUMIN_MASTERS [23:00:18] Krenair: cumin is for SRE, labs networks is for standard usage, contint is because releng was testing sending logs to elasticsearch [23:00:35] sure but why would SRE be connecting from cumin masters to the elasticsearch HTTP/HTTPS ports? [23:00:57] Krenair: it's for the cookbooks to allow to query ES IIRC [23:01:00] ah [23:01:26] and perform automated admin actions [23:01:50] * Krenair should've checked blame first [23:01:53] to be sure check the git blame and should have been added by ge.hel not too long ago [23:01:55] contint might be able to go away, i know has.har was testing things but not sure it went anywhere, or maybe its 10% time that hasn't been available [23:02:24] yeah https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/485092/ [23:02:26] ok [23:03:28] here [23:04:15] PROBLEM - Mjolnir bulk update failure check - codfw on icinga1001 is CRITICAL: 2.406e+06 gt 2 https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [23:04:17] thanks ebernhardson volans :) [23:04:28] yw :) [23:07:48] anybody for SWAT? [23:08:40] SMalyshev: i can ship if noone is arround i guess [23:09:09] one patch, easy :) [23:09:18] (03PS2) 10EBernhardson: Actually load WBCS-Lexeme extension before trying to use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499309 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [23:09:34] (03CR) 10EBernhardson: [C: 03+2] Actually load WBCS-Lexeme extension before trying to use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499309 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [23:09:43] and labs only, my preferred SWAT patches :) [23:09:44] ebernhardson: thanks! [23:10:42] (03Merged) 10jenkins-bot: Actually load WBCS-Lexeme extension before trying to use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499309 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [23:11:01] (03CR) 10jenkins-bot: Actually load WBCS-Lexeme extension before trying to use it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499309 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [23:11:49] SMalyshev: since labs only syncing directly [23:11:56] (i mean not pulling to mwdebug) [23:12:32] ebernhardson: yeah I'll wait until labs sync [23:12:39] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: T216206 : sync noop labs config: Actually load WBCS-Lexeme extension before trying to use it (duration: 00m 57s) [23:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:42] T216206: Set up WikibaseLexemeCirrusSearch extension for Elastic code in WikibaseLexeme - https://phabricator.wikimedia.org/T216206 [23:13:10] (03CR) 10Tim Starling: [C: 03+1] "Looks pretty straightforward. Feel free to self-merge and deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499011 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [23:13:12] huh, canary check failed and it still synced: 23:12:29 Check 'Logstash Error rate for mw1263.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.05, After: 2.00, Threshold: 1.00) [23:13:36] i mean, in the past those were always false positives (going from 0 to 1 errors in time period) [23:14:05] oh, i should read more: 23:12:29 Canary error check failed for 1 canaries, less than threshold to halt deployment (2/11), see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details. Continuing... [23:14:49] yea those error messages look fine, they are "regular" fatals [23:17:07] (03CR) 10Volans: "As requested I've done only a quick general pass, see my comments inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [23:18:02] 10Operations, 10cloud-services-team (Kanban): apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10cwdent) [23:23:22] (03CR) 10Volans: "Most comments were made together with Cas in a joint CR session, reporting them here as notes mostly." (0320 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [23:26:28] (03PS1) 10Alex Monk: [WIP] Move cumin_masters out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/499355 [23:28:36] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Move cumin_masters out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk) [23:31:36] (03PS2) 10Alex Monk: [WIP] Move cumin_masters out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/499355 [23:36:20] (03PS3) 10Alex Monk: Move cumin_masters out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/499355 [23:37:19] !log repooled wdqs2003 [23:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:12] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cloudnet2001-dev.codfw.wmnet and performed the follow...