[00:00:05] twentyafterfour: How many deployers does it take to do Phabricator update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T0000). [00:01:11] twentyafterfour: I'm just finishing up a MW core merge and deploy but please go ahead. [00:01:47] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT Move FundraisingTranslateWorkflow load to after Translate I73452ae8 (duration: 00m 56s) [00:01:51] jforrester@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [00:02:01] James_F: it's going to take me just a minute anyway [00:02:10] thanks! [00:02:20] * James_F glares at jenkins as if that'll make it go faster.;-) [00:05:51] (03PS1) 10Herron: gerrit: enable apache modsec ipaddress ban list [puppet] - 10https://gerrit.wikimedia.org/r/497958 [00:08:07] !log deploying phabricator upgrade [00:08:07] twentyafterfour: Failed to log message to wiki. Somebody should check the error logs. [00:08:17] stashbot: :P [00:08:17] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [00:08:40] * James_F grins. [00:09:27] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.22/includes/parser/BlockLevelPass.php: SAT T218817 Unbreak parser line counting for long wikitext pages I22eebb70a I55a2c4c0 I41a45266d (duration: 00m 56s) [00:09:30] jforrester@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [00:09:54] Oops, typo in 'SWAT'. [00:09:59] Still, done now. [00:10:49] James_F lol [00:12:45] OK, SWAT (very) over, I'm done. [00:19:15] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:20:11] phab upgrade seems to have been uneventful [00:20:15] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 5.366 second response time https://phabricator.wikimedia.org/T174916 [00:20:21] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:20:21] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:20:34] not sure about these alerts though [00:20:37] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:21:34] James_F: could that be related to what you deployed? [00:21:51] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:21:53] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:21:53] (03CR) 10Paladox: [C: 03+1] gerrit: enable apache modsec ipaddress ban list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497958 (owner: 10Herron) [00:22:17] hmm seems to have returned to normal either way [00:22:55] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:22:55] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [00:23:39] (03CR) 10GTirloni: [C: 03+1] "Just a heads up that the upcoming Toolforge Trusty deprecation deadline is close. We'll be shutting down the old cluster on Mar 25 (https:" [puppet] - 10https://gerrit.wikimedia.org/r/482118 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [00:24:13] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [00:29:08] (03CR) 10GTirloni: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15238/" [puppet] - 10https://gerrit.wikimedia.org/r/497762 (https://phabricator.wikimedia.org/T217086) (owner: 10GTirloni) [00:29:18] (03PS2) 10GTirloni: labstore: Increase nfs-exportd interval from 60 to 300s [puppet] - 10https://gerrit.wikimedia.org/r/497762 (https://phabricator.wikimedia.org/T217086) [00:33:53] PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:34:33] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:35:55] (03PS1) 10Herron: move to apache administrative [puppet] - 10https://gerrit.wikimedia.org/r/497961 [00:39:33] (03PS1) 10Herron: move to administrative [labs/private] - 10https://gerrit.wikimedia.org/r/497963 [00:39:47] (03CR) 10Herron: [V: 03+2 C: 03+2] move to administrative [labs/private] - 10https://gerrit.wikimedia.org/r/497963 (owner: 10Herron) [00:42:27] (03PS2) 10Herron: move to apache administrative [puppet] - 10https://gerrit.wikimedia.org/r/497961 [00:42:49] RECOVERY - MariaDB Slave Lag: m3 on db2042 is OK: OK slave_sql_lag Replication lag: 50.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:43:29] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:45:12] (03CR) 10Herron: [C: 03+2] move to apache administrative [puppet] - 10https://gerrit.wikimedia.org/r/497961 (owner: 10Herron) [00:54:46] (03PS3) 10GTirloni: profile::base::labs - Ability to disable Puppet failure emails [puppet] - 10https://gerrit.wikimedia.org/r/495670 (https://phabricator.wikimedia.org/T218009) [00:57:51] (03CR) 10GTirloni: "This has been done manually and seems to be fine. Adding reviewers just in case someone spots something odd. Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/496130 (https://phabricator.wikimedia.org/T218185) (owner: 10GTirloni) [00:58:49] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [01:00:09] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.789 second response time https://phabricator.wikimedia.org/T174916 [01:01:55] (03Abandoned) 10GTirloni: labsdb: Increase net_{read,write}_timeout [puppet] - 10https://gerrit.wikimedia.org/r/477137 (https://phabricator.wikimedia.org/T184126) (owner: 10GTirloni) [01:02:39] (03Abandoned) 10GTirloni: git-sync-upstream: Send cron mail in case of failures [puppet] - 10https://gerrit.wikimedia.org/r/468865 (https://phabricator.wikimedia.org/T184261) (owner: 10GTirloni) [01:03:38] (03PS3) 10GTirloni: toollabs::k8s::worker - Allow prometheus to access read-only metrics port 10255 [puppet] - 10https://gerrit.wikimedia.org/r/486142 (https://phabricator.wikimedia.org/T214512) [01:04:09] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [01:04:33] (03CR) 10jerkins-bot: [V: 04-1] toollabs::k8s::worker - Allow prometheus to access read-only metrics port 10255 [puppet] - 10https://gerrit.wikimedia.org/r/486142 (https://phabricator.wikimedia.org/T214512) (owner: 10GTirloni) [01:06:37] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [01:12:46] (03PS1) 10Herron: add notify service apache2 on change to administrative file [puppet] - 10https://gerrit.wikimedia.org/r/497965 [01:13:37] (03PS4) 10GTirloni: toollabs::k8s::worker - Allow prometheus to access read-only metrics port 10255 [puppet] - 10https://gerrit.wikimedia.org/r/486142 (https://phabricator.wikimedia.org/T214512) [01:13:40] (03CR) 10jerkins-bot: [V: 04-1] add notify service apache2 on change to administrative file [puppet] - 10https://gerrit.wikimedia.org/r/497965 (owner: 10Herron) [01:15:13] (03PS2) 10Herron: add notify service apache2 on change to administrative file [puppet] - 10https://gerrit.wikimedia.org/r/497965 [01:15:41] (03CR) 10GTirloni: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/486142 (https://phabricator.wikimedia.org/T214512) (owner: 10GTirloni) [01:16:29] (03CR) 10Herron: [C: 03+2] add notify service apache2 on change to administrative file [puppet] - 10https://gerrit.wikimedia.org/r/497965 (owner: 10Herron) [01:23:03] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [01:24:09] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916 [01:26:37] (03PS2) 10Herron: gerrit: enable apache modsec ipaddress ban list [puppet] - 10https://gerrit.wikimedia.org/r/497958 [01:28:09] (03CR) 10Herron: gerrit: enable apache modsec ipaddress ban list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497958 (owner: 10Herron) [01:29:49] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/15242/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/497958 (owner: 10Herron) [01:31:12] (03PS3) 10Herron: gerrit: enable apache modsec ipaddress ban list [puppet] - 10https://gerrit.wikimedia.org/r/497958 [01:32:25] (03CR) 10Herron: [C: 03+2] gerrit: enable apache modsec ipaddress ban list [puppet] - 10https://gerrit.wikimedia.org/r/497958 (owner: 10Herron) [01:33:17] (03CR) 10Paladox: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/497958 (owner: 10Herron) [01:41:09] (03PS1) 10Herron: wikitech: enable modsec waf administrative [puppet] - 10https://gerrit.wikimedia.org/r/497968 [01:44:33] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/15244/" [puppet] - 10https://gerrit.wikimedia.org/r/497968 (owner: 10Herron) [01:45:11] (03PS2) 10Herron: wikitech: enable modsec waf administrative [puppet] - 10https://gerrit.wikimedia.org/r/497968 [01:47:08] (03CR) 10Herron: [C: 03+2] wikitech: enable modsec waf administrative [puppet] - 10https://gerrit.wikimedia.org/r/497968 (owner: 10Herron) [02:08:27] (03PS1) 10Herron: phabricator: enable modsec waf administrative [puppet] - 10https://gerrit.wikimedia.org/r/497974 [02:12:28] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/15245/" [puppet] - 10https://gerrit.wikimedia.org/r/497974 (owner: 10Herron) [02:15:29] (03CR) 10Herron: [C: 03+2] phabricator: enable modsec waf administrative [puppet] - 10https://gerrit.wikimedia.org/r/497974 (owner: 10Herron) [02:30:27] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19565576 and 0 seconds [02:31:43] (03PS7) 10Alex Monk: [WIP] acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) [02:31:45] (03PS6) 10Alex Monk: [WIP] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 [02:36:55] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 105232 and 59 seconds [02:39:13] (03PS8) 10Alex Monk: acme-chief: Add script for Designate integration [puppet] - 10https://gerrit.wikimedia.org/r/497670 (https://phabricator.wikimedia.org/T206922) [02:40:14] (03PS7) 10Alex Monk: [WIP] Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 [02:41:08] (03PS8) 10Alex Monk: Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) [02:47:36] (03CR) 10Alex Monk: Allow acme-chief to provide unified cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [02:57:51] (03PS9) 10Alex Monk: Allow acme-chief to provide unified cert [puppet] - 10https://gerrit.wikimedia.org/r/497929 (https://phabricator.wikimedia.org/T182927) [03:07:59] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.358 second response time https://phabricator.wikimedia.org/T174916 [03:11:59] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [03:27:59] PROBLEM - puppet last run on mc1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:43:17] PROBLEM - puppet last run on cp1076 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:54:21] RECOVERY - puppet last run on mc1027 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [04:09:41] RECOVERY - puppet last run on cp1076 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:19:25] (03PS1) 10Smalyshev: Deploy WikibaseLexemeCirrusSearch: Part 1 - set up variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497988 (https://phabricator.wikimedia.org/T216206) [04:19:27] (03PS1) 10Smalyshev: Deploy WikibaseLexemeCirrusSearch: Part 2 - extensionlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497989 (https://phabricator.wikimedia.org/T216206) [04:19:29] (03PS1) 10Smalyshev: [BETA] Enable WikibaseLexemeCirrusSearch on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497990 (https://phabricator.wikimedia.org/T216206) [04:19:31] (03PS1) 10Smalyshev: [BETA] Enable WikibaseLexemeCirrusSearch on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497991 (https://phabricator.wikimedia.org/T216206) [04:24:19] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.833 second response time https://phabricator.wikimedia.org/T174916 [04:26:44] (03CR) 10Smalyshev: [C: 04-1] "This is waiting for full deployment cycle of the extension." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497989 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [04:28:13] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [04:31:47] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:47:51] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:54:45] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/d/000000523/kafka-graphite?refresh=5m&panelId=29&fullscreen&orgId=1 [04:58:11] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:03:59] PROBLEM - puppet last run on cloudvirt1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:04:43] PROBLEM - puppet last run on mw1346 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:06:53] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.959 second response time https://phabricator.wikimedia.org/T174916 [05:10:55] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [05:14:15] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:28:21] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:30:23] RECOVERY - puppet last run on cloudvirt1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:31:09] RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [05:40:41] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:59:59] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:00:26] (03PS2) 10Marostegui: wikireplica_dns.yaml: Depool dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/497793 [06:01:14] (03CR) 10Marostegui: [C: 03+2] wikireplica_dns.yaml: Depool dbproxy1011 [puppet] - 10https://gerrit.wikimedia.org/r/497793 (owner: 10Marostegui) [06:04:17] !log Run wmcs-wikireplica-dns on cloudcontrol1003 to drain dbproxy1011 [06:04:18] marostegui: Failed to log message to wiki. Somebody should check the error logs. [06:06:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498003 [06:11:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498003 (owner: 10Marostegui) [06:12:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498003 (owner: 10Marostegui) [06:12:19] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [06:12:49] !log Upgrade and reboot dbproxy1011 [06:12:49] marostegui: Failed to log message to wiki. Somebody should check the error logs. [06:13:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101:3317 (duration: 01m 10s) [06:13:35] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [06:21:02] (03PS1) 10Marostegui: Revert "wikireplica_dns.yaml: Depool dbproxy1011" [puppet] - 10https://gerrit.wikimedia.org/r/498007 [06:23:35] (03CR) 10Marostegui: [C: 03+2] Revert "wikireplica_dns.yaml: Depool dbproxy1011" [puppet] - 10https://gerrit.wikimedia.org/r/498007 (owner: 10Marostegui) [06:24:06] log Run wmcs-wikireplica-dns on cloudcontrol1003 to get dbproxy1011 back [06:24:20] !log Run wmcs-wikireplica-dns on cloudcontrol1003 to get dbproxy1011 back [06:24:21] marostegui: Failed to log message to wiki. Somebody should check the error logs. [06:25:37] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.647 second response time https://phabricator.wikimedia.org/T174916 [06:28:11] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:28:23] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:29:33] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [06:31:35] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/conftool/schema.yaml] [06:37:07] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.524 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:37:21] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:49:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498009 [06:53:10] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498009 (owner: 10Marostegui) [06:54:16] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498009 (owner: 10Marostegui) [06:55:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:3317 (duration: 00m 56s) [06:55:27] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [06:57:53] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:31] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 507.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:59:11] PROBLEM - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 532.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:59:50] ACKNOWLEDGEMENT - MariaDB Slave Lag: m3 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 532.56 seconds Marostegui being handled https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:59:50] ACKNOWLEDGEMENT - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 557.02 seconds Marostegui being handled https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:03:58] !log restart pdfrender on scb1002 [07:03:58] elukey: Failed to log message to wiki. Somebody should check the error logs. [07:05:23] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.005 second response time https://phabricator.wikimedia.org/T174916 [07:30:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498012 [07:31:30] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498012 (owner: 10Marostegui) [07:32:34] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498012 (owner: 10Marostegui) [07:33:47] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1079 (duration: 00m 57s) [07:33:48] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [07:34:15] !log Deploy schema change on db1079, this will generate lag on labsdb:s8 [07:34:15] marostegui: Failed to log message to wiki. Somebody should check the error logs. [07:35:14] <_joe_> !log rolling restart of php-fpm to pick up some changes [07:35:15] _joe_: Failed to log message to wiki. Somebody should check the error logs. [07:54:27] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/497431 (owner: 10Alex Monk) [07:54:40] (03PS4) 10Vgutierrez: acme_chief: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/497431 (owner: 10Alex Monk) [07:55:39] PROBLEM - Host cp3039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [07:58:11] vgutierrez: ^ is this something scheduled? [07:58:25] nope AFAIK [07:58:31] checking, thanks jijiki [07:58:37] or related to T203272 [07:58:38] T203272: cp3038, cp3039 - power supply redundancy failure - https://phabricator.wikimedia.org/T203272 [07:58:40] tx [07:58:41] note that it's the management interface, not the host (luckily) [07:58:45] yes [07:59:05] yep, cp3039 is up & running [07:59:29] so it looks like the mgmt interface got toasted [07:59:29] labsdb1009.mgmt also is marked as down since 1 day 13h in icinga [07:59:56] dunno if related.. completely different DC [08:00:18] I doubt it is related, I will make tasks [08:00:31] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Update acme_chief::cert resource to fetch several cert versions [puppet] - 10https://gerrit.wikimedia.org/r/496148 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [08:00:46] (03PS8) 10Vgutierrez: acme_chief: Update acme_chief::cert resource to fetch several cert versions [puppet] - 10https://gerrit.wikimedia.org/r/496148 (https://phabricator.wikimedia.org/T207295) [08:00:59] likely unrelated, sure [08:01:01] RECOVERY - Host cp3039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.26 ms [08:01:11] ok just a hiccup [08:01:35] !log deploying directory based certificates in acme-chief clients - T207295 [08:01:37] vgutierrez: Failed to log message to wiki. Somebody should check the error logs. [08:01:38] T207295: Expose not-yet-live certs to clients so they can handle OCSP stapling - https://phabricator.wikimedia.org/T207295 [08:02:05] directory based certificates... not my best explanatory message about that... [08:02:07] jijiki: yeah, it seems like a transient issue, I can login fine [08:02:11] E_MORNING [08:02:15] tx ema [08:04:33] ema: labsdb1009 is probably unrelated yep [08:06:04] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498013 [08:07:26] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498013 (owner: 10Marostegui) [08:08:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498013 (owner: 10Marostegui) [08:09:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1079 (duration: 00m 56s) [08:09:35] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [08:10:43] ACKNOWLEDGEMENT - Host labsdb1009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Effie Mouzeli Task already filed T218789 [08:10:53] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/librenms],File[/etc/acmecerts/netbox] [08:11:01] yup.. that's me... already checking [08:11:30] ACKNOWLEDGEMENT - SSH labsdb1009.mgmt on labsdb1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Effie Mouzeli Task already filed T218789 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:12:03] PROBLEM - puppet last run on dbmonitor1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/tendril] [08:13:23] RECOVERY - MariaDB Slave Lag: m3 on db2042 is OK: OK slave_sql_lag Replication lag: 0.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:13:31] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/librenms],File[/etc/acmecerts/netbox] [08:13:50] (03PS1) 10Vgutierrez: Revert "acme_chief: Update acme_chief::cert resource to fetch several cert versions" [puppet] - 10https://gerrit.wikimedia.org/r/498015 [08:14:01] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:14:10] (03CR) 10Vgutierrez: [C: 03+2] Revert "acme_chief: Update acme_chief::cert resource to fetch several cert versions" [puppet] - 10https://gerrit.wikimedia.org/r/498015 (owner: 10Vgutierrez) [08:15:43] PROBLEM - puppet last run on ores1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:16:15] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/mx] [08:16:25] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/lists] [08:16:47] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/ldap-labtest] [08:17:18] that's already fixed with the Revert... [08:17:33] now let's see why everything works in the testing environments and not in prod : [08:17:34] :/ [08:17:35] (03PS1) 10Marostegui: db-eqiad.php. Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498016 [08:18:33] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php. Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498016 (owner: 10Marostegui) [08:19:30] (03Merged) 10jenkins-bot: db-eqiad.php. Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498016 (owner: 10Marostegui) [08:20:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1086 (duration: 00m 56s) [08:20:38] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [08:22:11] (03PS4) 10WMDE-leszek: Added a setting to define Wikibase entity types that have no RDF output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490586 (https://phabricator.wikimedia.org/T213483) [08:22:45] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/icinga] [08:23:01] PROBLEM - puppet last run on labstore1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/dumps] [08:25:03] (03CR) 10WMDE-leszek: "@Jforrester: .wmf.22 is now on group0 and group1, according to https://tools.wmflabs.org/versions/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490586 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [08:33:05] RECOVERY - puppet last run on dbmonitor1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:37:13] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:39:49] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:39:50] (03PS1) 10Marostegui: Revert "db-eqiad.php. Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498018 [08:40:51] PROBLEM - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/analytics 1205 MB (2% inode=99%) [08:41:33] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php. Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498018 (owner: 10Marostegui) [08:41:42] oh oh [08:42:07] RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:42:09] RECOVERY - Disk space on prometheus1003 is OK: DISK OK [08:42:35] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:42:38] (03Merged) 10jenkins-bot: Revert "db-eqiad.php. Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498018 (owner: 10Marostegui) [08:42:41] prometheus1003 is me [08:42:47] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1086 (duration: 00m 57s) [08:43:47] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [08:46:11] PROBLEM - puppet last run on dbmonitor1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/vgutierrez/tendril] [08:46:27] lol? seriously? [08:46:35] my fault obviously [08:48:01] (03PS1) 10Marostegui: Revert "Revert "db-eqiad.php. Depool db1086"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498020 [08:48:21] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:49:03] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:49:19] RECOVERY - puppet last run on labstore1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:51:07] (03CR) 10Marostegui: [C: 03+2] Revert "Revert "db-eqiad.php. Depool db1086"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498020 (owner: 10Marostegui) [08:52:11] (03Merged) 10jenkins-bot: Revert "Revert "db-eqiad.php. Depool db1086"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498020 (owner: 10Marostegui) [08:53:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1086 for upgrade (duration: 00m 56s) [08:53:20] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [08:53:24] !log Upgrade db1086 [08:53:24] marostegui: Failed to log message to wiki. Somebody should check the error logs. [08:56:43] RECOVERY - puppet last run on dbmonitor1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:57:31] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:57:57] PROBLEM - Labs LDAP on seaborgium is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [08:58:09] (03PS1) 10Vgutierrez: Revert "Revert "acme_chief: Update acme_chief::cert resource to fetch several cert versions"" [puppet] - 10https://gerrit.wikimedia.org/r/498021 [08:58:14] PROBLEM - toolschecker: Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 237 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [08:58:29] PROBLEM - Check systemd state on seaborgium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:58:41] hey [08:58:43] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "acme_chief: Update acme_chief::cert resource to fetch several cert versions"" [puppet] - 10https://gerrit.wikimedia.org/r/498021 (owner: 10Vgutierrez) [08:58:46] LDAP down? [08:59:21] (03PS2) 10Vgutierrez: Revert "Revert "acme_chief: Update acme_chief::cert resource to fetch several cert versions"" [puppet] - 10https://gerrit.wikimedia.org/r/498021 [08:59:35] arturo: it looks like it :( [08:59:47] great [09:01:42] (03CR) 10Vgutierrez: [C: 03+2] Revert "Revert "acme_chief: Update acme_chief::cert resource to fetch several cert versions"" [puppet] - 10https://gerrit.wikimedia.org/r/498021 (owner: 10Vgutierrez) [09:04:50] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 5 others: [Story] Implement per property caching for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T218115 (10alaa_wmde) [09:06:13] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[nginx] [09:09:32] `Mar 21 09:05:43 seaborgium slapd[29378]: main: TLS init def ctx failed: -1` [09:09:42] arturo: noted, issue in acme-cheif [09:09:44] *acme-chief [09:09:49] rolling back, sorry about the noise [09:09:58] cool [09:10:35] misses phab comment updates [09:11:05] PROBLEM - Long running screen/tmux on prometheus2004 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 7256, 1728444s 1728000s). [09:14:00] ACKNOWLEDGEMENT - Check systemd state on seaborgium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Arturo Borrero Gonzalez acmechief issue [09:14:00] ACKNOWLEDGEMENT - Labs LDAP on seaborgium is CRITICAL: Could not bind to the LDAP server Arturo Borrero Gonzalez acmechief issue https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [09:15:10] ACKNOWLEDGEMENT - toolschecker: Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 237 bytes in 0.008 second response time Arturo Borrero Gonzalez acmechief problem https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [09:15:33] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational [09:16:04] (03CR) 10Dzahn: [C: 03+1] "duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/496761 but i don't mind whihc one you used so i can abandon. we have to " [puppet] - 10https://gerrit.wikimedia.org/r/497840 (https://phabricator.wikimedia.org/T217813) (owner: 10Effie Mouzeli) [09:16:09] (03Abandoned) 10Dzahn: admins: add perf-roots on mediawiki-maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/496761 (https://phabricator.wikimedia.org/T217813) (owner: 10Dzahn) [09:16:13] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/d/000000523/kafka-graphite?refresh=5m&panelId=29&fullscreen&orgId=1 [09:17:13] RECOVERY - Labs LDAP on seaborgium is OK: LDAP OK - 0.005 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [09:17:23] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/ldap-labtest] [09:17:30] RECOVERY - toolschecker: Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.362 second response time https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [09:17:40] !log Upgrade and reboot db1086 [09:17:41] marostegui: Failed to log message to wiki. Somebody should check the error logs. [09:17:45] RECOVERY - Check systemd state on seaborgium is OK: OK - running: The system is fully operational [09:18:23] PROBLEM - puppet last run on labstore1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/dumps] [09:18:28] (03PS1) 10Vgutierrez: Revert "Revert "Revert "acme_chief: Update acme_chief::cert resource to fetch several cert versions""" [puppet] - 10https://gerrit.wikimedia.org/r/498023 [09:18:55] (03CR) 10Vgutierrez: [C: 03+2] Revert "Revert "Revert "acme_chief: Update acme_chief::cert resource to fetch several cert versions""" [puppet] - 10https://gerrit.wikimedia.org/r/498023 (owner: 10Vgutierrez) [09:20:21] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/ldap] [09:22:26] (03PS1) 10Jcrespo: mariadb-backups: Make sure retention is handled correctly [puppet] - 10https://gerrit.wikimedia.org/r/498024 (https://phabricator.wikimedia.org/T210292) [09:23:07] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/apt] [09:23:21] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/icinga] [09:23:23] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Make sure retention is handled correctly [puppet] - 10https://gerrit.wikimedia.org/r/498024 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:23:37] PROBLEM - puppet last run on archiva1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/archiva] [09:23:47] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/acmecerts/apt] [09:25:15] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498025 [09:25:29] 10Operations, 10Traffic, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ema) We're currently using swift-rw (eqiad only) as the origin server for upload cache misses. Thumb traffic can however be served active/activ... [09:25:35] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:26:58] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498025 (owner: 10Marostegui) [09:27:19] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:27:44] !log rolling reboot of maps servers in codfw for kernel update [09:27:44] moritzm: Failed to log message to wiki. Somebody should check the error logs. [09:27:55] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498025 (owner: 10Marostegui) [09:28:24] (03Abandoned) 10Dzahn: k8s::flannel: remove upstart, use systemd::service instead [puppet] - 10https://gerrit.wikimedia.org/r/482118 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [09:28:59] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:29:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1086 after mysql upgrade (duration: 00m 56s) [09:29:14] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [09:29:36] (03PS2) 10Dzahn: jenkins: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497511 [09:30:42] (03PS1) 10Ema: ATS: use 'swift-ro' as the origin server for thumb traffic [puppet] - 10https://gerrit.wikimedia.org/r/498028 (https://phabricator.wikimedia.org/T213263) [09:30:51] (03PS1) 10Jcrespo: mariadb-snapshots: Allow the option to only postprocess snapshots [puppet] - 10https://gerrit.wikimedia.org/r/498029 (https://phabricator.wikimedia.org/T210292) [09:31:29] (03CR) 10Dzahn: [C: 03+2] jenkins: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497511 (owner: 10Dzahn) [09:31:57] (03CR) 10Gehel: [C: 04-1] "Minor comment inline. I'm not thrilled by the duplication this introduces, but I'm not sure I have a better way of doing it." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [09:32:47] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:33:14] (03PS2) 10Dzahn: pybal: add Icinga notes URL [puppet] - 10https://gerrit.wikimedia.org/r/497509 [09:35:08] runs puppet on mw1282 and it's fine [09:35:44] (03CR) 10Dzahn: [C: 03+2] pybal: add Icinga notes URL [puppet] - 10https://gerrit.wikimedia.org/r/497509 (owner: 10Dzahn) [09:37:02] (03PS2) 10Dzahn: redis: add Icinga notes URL [puppet] - 10https://gerrit.wikimedia.org/r/497508 [09:38:03] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:39:29] (03PS2) 10Ema: ATS: use 'swift-ro' as the origin server for thumb traffic [puppet] - 10https://gerrit.wikimedia.org/r/498028 (https://phabricator.wikimedia.org/T213263) [09:39:31] (03PS1) 10Ema: ATS: SystemTap probe for origin server connections [puppet] - 10https://gerrit.wikimedia.org/r/498031 (https://phabricator.wikimedia.org/T213263) [09:39:33] (03CR) 10Marostegui: [C: 03+1] mariadb-snapshots: Allow the option to only postprocess snapshots [puppet] - 10https://gerrit.wikimedia.org/r/498029 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:40:02] (03CR) 10Dzahn: [C: 03+2] redis: add Icinga notes URL [puppet] - 10https://gerrit.wikimedia.org/r/497508 (owner: 10Dzahn) [09:42:20] !log Depool scb* in codfw from serving cxserver, finishing its migration to k8s - T213195 [09:42:22] jijiki: Failed to log message to wiki. Somebody should check the error logs. [09:42:25] T213195: Migrate cxserver to kubernetes - https://phabricator.wikimedia.org/T213195 [09:42:47] !log jiji@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=cxserver,cluster=scb,name=scb.* [09:42:47] jiji@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [09:45:59] (03PS2) 10Dzahn: dumps: add Icinga notes URL [puppet] - 10https://gerrit.wikimedia.org/r/497510 [09:47:30] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498033 [09:48:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] labmon: Delete archived Graphite directories [puppet] - 10https://gerrit.wikimedia.org/r/496130 (https://phabricator.wikimedia.org/T218185) (owner: 10GTirloni) [09:48:43] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:49:09] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:23] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:49:41] RECOVERY - puppet last run on archiva1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:49:43] RECOVERY - puppet last run on labstore1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:49:46] (03CR) 10ArielGlenn: [C: 03+1] "That's the best url for now, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/497510 (owner: 10Dzahn) [09:50:04] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) 05Open→03Stalled [09:50:10] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Normalize thumbnail request URLs in Varnish to avoid cachebusting - https://phabricator.wikimedia.org/T216339 (10Gilles) 05Open→03Stalled [09:50:14] 10Operations, 10Performance-Team, 10Traffic, 10media-storage, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Gilles) [09:50:52] (03CR) 10Dzahn: [C: 03+2] dumps: add Icinga notes URL [puppet] - 10https://gerrit.wikimedia.org/r/497510 (owner: 10Dzahn) [09:51:06] (03PS3) 10Dzahn: dumps: add Icinga notes URL [puppet] - 10https://gerrit.wikimedia.org/r/497510 [09:57:06] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498033 (owner: 10Marostegui) [09:58:09] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498033 (owner: 10Marostegui) [09:59:06] (03PS3) 10Hashar: scap: add logging to clean > prune-git-branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497781 [09:59:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1086 (duration: 00m 58s) [09:59:28] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [10:04:11] (03CR) 10Elukey: [C: 03+2] yarn: allow the configuration of maximum app ids retained in HDFS/Zookeeper [puppet/cdh] - 10https://gerrit.wikimedia.org/r/497702 (https://phabricator.wikimedia.org/T218758) (owner: 10Elukey) [10:06:07] (03PS1) 10Elukey: Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/498038 [10:09:24] (03CR) 10Elukey: [C: 03+2] Update cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/498038 (owner: 10Elukey) [10:15:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498043 [10:17:16] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498043 (owner: 10Marostegui) [10:17:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498043 (owner: 10Marostegui) [10:19:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3317 (duration: 00m 56s) [10:19:01] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [10:22:13] (03PS1) 10Elukey: profile::hadoop::common: add new Yarn RM properties [puppet] - 10https://gerrit.wikimedia.org/r/498044 (https://phabricator.wikimedia.org/T218758) [10:23:23] !log rebooting labtestcontrol2001 for kernel update [10:23:24] moritzm: Failed to log message to wiki. Somebody should check the error logs. [10:24:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/498028 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [10:26:24] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15247/" [puppet] - 10https://gerrit.wikimedia.org/r/498044 (https://phabricator.wikimedia.org/T218758) (owner: 10Elukey) [10:28:19] (03PS3) 10Ema: ATS: use 'swift-ro' as the origin server for thumb traffic [puppet] - 10https://gerrit.wikimedia.org/r/498028 (https://phabricator.wikimedia.org/T213263) [10:30:05] (03CR) 10Ema: [C: 03+2] ATS: use 'swift-ro' as the origin server for thumb traffic [puppet] - 10https://gerrit.wikimedia.org/r/498028 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [10:37:37] (03PS1) 10Muehlenhoff: Add curl to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/498045 (https://phabricator.wikimedia.org/T213527) [10:39:11] (03PS1) 10Vgutierrez: acme-chief-api: Fix file_metadata for /puppet/v3/file_metadata/acmedata/{certname}/{part} [software/acme-chief] - 10https://gerrit.wikimedia.org/r/498046 (https://phabricator.wikimedia.org/T218862) [10:40:38] (03CR) 10jerkins-bot: [V: 04-1] acme-chief-api: Fix file_metadata for /puppet/v3/file_metadata/acmedata/{certname}/{part} [software/acme-chief] - 10https://gerrit.wikimedia.org/r/498046 (https://phabricator.wikimedia.org/T218862) (owner: 10Vgutierrez) [10:40:41] (03PS2) 10Muehlenhoff: Add curl to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/498045 (https://phabricator.wikimedia.org/T213527) [10:46:57] !log restart hadoop yarn resource managers on an-master100[1,2] to pick up new settings [10:46:58] elukey: Failed to log message to wiki. Somebody should check the error logs. [10:47:12] this should not generate all the yarn alarms happened the other day [10:47:27] (more context in https://phabricator.wikimedia.org/T218758) [10:47:41] if it does, my fault but nothing is really on fire, only a ton of noise [10:49:53] (downtimed the hadoop workers as precaution) [10:52:29] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus2003.codfw.wmnet [10:52:29] filippo@puppetmaster1001: Failed to log message to wiki. Somebody should check the error logs. [10:57:12] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=prometheus2004.codfw.wmnet [10:57:12] filippo@puppetmaster1001: Failed to log message to wiki. Somebody should check the error logs. [10:59:12] (03CR) 10Muehlenhoff: Add --disable-user option to offboard script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497758 (owner: 10Muehlenhoff) [10:59:14] (03PS3) 10Muehlenhoff: Add --disable-user option to offboard script [puppet] - 10https://gerrit.wikimedia.org/r/497758 [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T1100). [11:00:04] Ammarpad: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:44] I can swat today [11:01:12] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495494 (https://phabricator.wikimedia.org/T217929) (owner: 10Ammarpad) [11:02:14] (03Merged) 10jenkins-bot: Add new throttle rule for LMU Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495494 (https://phabricator.wikimedia.org/T217929) (owner: 10Ammarpad) [11:04:21] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:495494|Add new throttle rule for LMU Edit-a-thon (T217929)]] (duration: 00m 57s) [11:04:22] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [11:04:23] T217929: Wikimedia-Site-requests: LMU Edit-a-thon - https://phabricator.wikimedia.org/T217929 [11:04:44] Ammarpad: the patch is deployed [11:04:58] !log EU SWAT finished [11:04:58] zeljkof: Failed to log message to wiki. Somebody should check the error logs. [11:13:29] (03CR) 10Filippo Giunchedi: [C: 03+1] Add curl to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/498045 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [11:17:43] (03PS2) 10Vgutierrez: acme-chief-api: Fix file_metadata for /puppet/v3/file_metadata/acmedata/{certname}/{part} [software/acme-chief] - 10https://gerrit.wikimedia.org/r/498046 (https://phabricator.wikimedia.org/T218862) [11:18:45] (03CR) 10Jbond: [C: 04-1] "curl is also in" [puppet] - 10https://gerrit.wikimedia.org/r/498045 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [11:19:46] (03CR) 10Vgutierrez: "PS1 fixes the testing issue described in T218862, PS2 fixes the issue itself and updates test_get_metadata cause it got deprecated and unn" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/498046 (https://phabricator.wikimedia.org/T218862) (owner: 10Vgutierrez) [11:21:41] (03CR) 10Muehlenhoff: "The contint manifests are only running in Cloud VPS and not in production, I'm not sure if they actually use our standard packages, adding" [puppet] - 10https://gerrit.wikimedia.org/r/498045 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [11:21:42] (03CR) 10Jbond: [C: 04-1] "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/498045 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [11:22:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/497318 (owner: 10Jbond) [11:23:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I can't think of any drawbacks of this patch from Cloud VPS point of view, so +1" [puppet] - 10https://gerrit.wikimedia.org/r/498045 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [11:23:57] (03PS4) 10Jbond: Ensure auto_restart is a dependency of auto_restart_service [puppet] - 10https://gerrit.wikimedia.org/r/497318 [11:25:28] (03CR) 10Jbond: [C: 03+2] Ensure auto_restart is a dependency of auto_restart_service [puppet] - 10https://gerrit.wikimedia.org/r/497318 (owner: 10Jbond) [11:25:30] (03CR) 10Arturo Borrero Gonzalez: "I'm not familiar with that variable $LABS_NETWORKS. We should know prometheus server address, right? why not using that directly?" [puppet] - 10https://gerrit.wikimedia.org/r/486142 (https://phabricator.wikimedia.org/T214512) (owner: 10GTirloni) [11:26:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/495670 (https://phabricator.wikimedia.org/T218009) (owner: 10GTirloni) [11:27:17] (03CR) 10Arturo Borrero Gonzalez: "Anyway, let me know if you want this merged" [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) (owner: 10BryanDavis) [11:27:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/497758 (owner: 10Muehlenhoff) [11:28:09] (03CR) 10Alex Monk: [C: 03+2] acme-chief-api: Fix file_metadata for /puppet/v3/file_metadata/acmedata/{certname}/{part} [software/acme-chief] - 10https://gerrit.wikimedia.org/r/498046 (https://phabricator.wikimedia.org/T218862) (owner: 10Vgutierrez) [11:29:41] (03Merged) 10jenkins-bot: acme-chief-api: Fix file_metadata for /puppet/v3/file_metadata/acmedata/{certname}/{part} [software/acme-chief] - 10https://gerrit.wikimedia.org/r/498046 (https://phabricator.wikimedia.org/T218862) (owner: 10Vgutierrez) [11:29:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/497496 (owner: 10Jbond) [11:30:17] (03PS2) 10Arturo Borrero Gonzalez: openstack proxyleaks: Rm check for old proxy-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/497590 (owner: 10Alex Monk) [11:30:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Grant root on MediaWiki maintenance hosts to perf-roots - https://phabricator.wikimedia.org/T217813 (10jijiki) [11:31:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack proxyleaks: Rm check for old proxy-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/497590 (owner: 10Alex Monk) [11:31:51] (03PS4) 10Muehlenhoff: Add --disable-user option to offboard script [puppet] - 10https://gerrit.wikimedia.org/r/497758 [11:34:13] (03CR) 10Muehlenhoff: [C: 03+2] Add --disable-user option to offboard script [puppet] - 10https://gerrit.wikimedia.org/r/497758 (owner: 10Muehlenhoff) [11:36:20] !log gerrit2001 (not the master prod server)- scheduled downtime and rebooting for upgrade [11:36:21] mutante: Failed to log message to wiki. Somebody should check the error logs. [11:36:23] (03PS1) 10MarcoAurelio: offboard-user: fix typo on instructions [puppet] - 10https://gerrit.wikimedia.org/r/498052 [11:36:58] (03PS1) 10Vgutierrez: Release 0.15 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/498053 (https://phabricator.wikimedia.org/T218862) [11:37:22] (03PS2) 10MarcoAurelio: offboard-user: fix typo on instructions [puppet] - 10https://gerrit.wikimedia.org/r/498052 [11:37:48] (03CR) 10Alex Monk: [C: 03+2] Release 0.15 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/498053 (https://phabricator.wikimedia.org/T218862) (owner: 10Vgutierrez) [11:38:00] (03PS1) 10Alex Monk: acme-chief-api: Fix file_metadata for /puppet/v3/file_metadata/acmedata/{certname}/{part} [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/498054 (https://phabricator.wikimedia.org/T218862) [11:38:09] (03CR) 10Alex Monk: [C: 03+2] acme-chief-api: Fix file_metadata for /puppet/v3/file_metadata/acmedata/{certname}/{part} [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/498054 (https://phabricator.wikimedia.org/T218862) (owner: 10Alex Monk) [11:38:18] (03PS1) 10Alex Monk: Release 0.15 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/498055 (https://phabricator.wikimedia.org/T218862) [11:38:48] (03CR) 10Arturo Borrero Gonzalez: "The main problem I see with this patch now is that there are a lot of changes not related to the systemd timers conversion. I'm referring " [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [11:39:41] (03Merged) 10jenkins-bot: Release 0.15 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/498053 (https://phabricator.wikimedia.org/T218862) (owner: 10Vgutierrez) [11:39:43] (03Merged) 10jenkins-bot: acme-chief-api: Fix file_metadata for /puppet/v3/file_metadata/acmedata/{certname}/{part} [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/498054 (https://phabricator.wikimedia.org/T218862) (owner: 10Alex Monk) [11:40:33] (03CR) 10Arturo Borrero Gonzalez: "Do you think we can merge this now @bstorm?" [puppet] - 10https://gerrit.wikimedia.org/r/494516 (https://phabricator.wikimedia.org/T208843) (owner: 10BryanDavis) [11:41:54] (03CR) 10Jbond: [C: 03+1] "now that https://gerrit.wikimedia.org/r/c/operations/puppet/+/497318 is merged i removed the pre_conditions from this. i also noticed tha" [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:41:59] (03CR) 10Muehlenhoff: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/498052 (owner: 10MarcoAurelio) [11:42:13] (03CR) 10Arturo Borrero Gonzalez: "What are we going to do with this patch @andrew?" [puppet] - 10https://gerrit.wikimedia.org/r/489230 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [11:42:16] (03CR) 10Muehlenhoff: [C: 03+2] offboard-user: fix typo on instructions [puppet] - 10https://gerrit.wikimedia.org/r/498052 (owner: 10MarcoAurelio) [11:43:16] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T218307 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:43:16] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T218307 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:43:21] (03CR) 10Vgutierrez: [C: 03+2] Release 0.15 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/498055 (https://phabricator.wikimedia.org/T218862) (owner: 10Alex Monk) [11:43:25] (03PS8) 10Jbond: Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:44:16] thanks for merging https://gerrit.wikimedia.org/r/498052 moritzm :) [11:44:19] (03PS2) 10Vgutierrez: Release 0.15 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/498055 (https://phabricator.wikimedia.org/T218862) (owner: 10Alex Monk) [11:44:43] I was looking for the hosts in which openldap is run but it looks it's not only one [11:44:51] (for puppet-compiler) [11:45:37] hauskatze: look for the ones having openldap::management as opposed to the ones actually running openldap [11:46:20] hauskatze: that will get you to mwmaint : modules/role/manifests/mediawiki/maintenance.pp: include ::profile::openldap::management [11:47:12] so it's mwmaint1002 and mwmaint2001 [11:47:12] mutante: looking at https://codesearch.wmflabs.org/operations/?q=openldap&i=nope&files=&repos= I got some labstest, seaborgium and several others [11:47:31] with ::management it's indeed the maintenance servers [11:47:37] but no traces on site.pp I could find [11:47:40] hauskatze: yes, but those are not the management ones.. those are the _actual_ones [11:47:43] oh puppet :) [11:47:45] that the others manage [11:48:28] (03PS1) 10Vgutierrez: debian: Add release 0.15 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/498056 (https://phabricator.wikimedia.org/T218862) [11:48:41] hauskatze: yea, it's not directly in site.pp because of the include above [11:49:07] I see mediawiki_maintenance.yaml so indeed it needs to be the mwmaint servers [11:49:40] it's merged already so nothing that I could do now :) [11:49:43] PROBLEM - YARN NodeManager Node-State on an-worker1089 is CRITICAL: CRITICAL: YARN NodeManager an-worker1089.eqiad.wmnet:8041 Node-State: Could not find the node report for node id : an-worker1089.eqiad.wmnet:8041 [11:49:43] PROBLEM - YARN NodeManager Node-State on an-worker1082 is CRITICAL: CRITICAL: YARN NodeManager an-worker1082.eqiad.wmnet:8041 Node-State: Could not find the node report for node id : an-worker1082.eqiad.wmnet:8041 [11:49:43] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:49:57] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:50:06] yep yep, just wanted to answer for next time because i just did the same [11:50:15] checking yarn [11:50:21] tx [11:50:28] mutante: thanks :) you're very kind [11:50:34] zeljkof: are we having a train today ? [11:50:44] (03CR) 10jenkins-bot: Release 0.15 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/498055 (https://phabricator.wikimedia.org/T218862) (owner: 10Alex Monk) [11:50:49] PROBLEM - YARN NodeManager Node-State on an-worker1092 is CRITICAL: CRITICAL: YARN NodeManager an-worker1092.eqiad.wmnet:8041 Node-State: Could not find the node report for node id : an-worker1092.eqiad.wmnet:8041 [11:50:50] there is a server I need to pool back [11:50:52] (03PS32) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [11:50:53] PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:51:04] (03CR) 10Mathew.onipe: elasticsearch: add profile for icinga checks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:51:05] jijiki: yes, in about an hour, if nothing explodes :) [11:51:21] ok, I will do it know [11:51:23] now* [11:51:25] tx [11:51:28] is an hour enough for you to do it ? [11:51:32] looks like it is :) [11:51:35] hehe [11:51:42] let me know if there are any problems, I can hold the train [11:51:57] but if there are no problems, I would like to get wmf.22 to all wikis today [11:52:47] sure [11:53:23] RECOVERY - YARN NodeManager Node-State on an-worker1092 is OK: OK: YARN NodeManager an-worker1092.eqiad.wmnet:8041 Node-State: RUNNING [11:53:27] RECOVERY - Hadoop NodeManager on an-worker1082 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:53:33] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:53:35] RECOVERY - YARN NodeManager Node-State on an-worker1082 is OK: OK: YARN NodeManager an-worker1082.eqiad.wmnet:8041 Node-State: RUNNING [11:53:35] RECOVERY - YARN NodeManager Node-State on an-worker1089 is OK: OK: YARN NodeManager an-worker1089.eqiad.wmnet:8041 Node-State: RUNNING [11:53:47] (03Restored) 10Alexandros Kosiaris: k8s::flannel: remove upstart, use systemd::service instead [puppet] - 10https://gerrit.wikimedia.org/r/482118 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [11:53:49] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:53:52] 10Operations, 10Parsoid-PHP: Install PHP7 on scandium - https://phabricator.wikimedia.org/T213493 (10Dzahn) a:03Dzahn [11:54:12] !log restart yarn node managers on an-worker10[82,89,92] - shutdown after a long yarn failover and only now downtime is expired [11:54:12] elukey: Failed to log message to wiki. Somebody should check the error logs. [11:54:26] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.15 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/498056 (https://phabricator.wikimedia.org/T218862) (owner: 10Vgutierrez) [11:55:07] (03PS1) 10MarcoAurelio: WIP: contint: change `/r/p` to `/r/` for gerrit links [puppet] - 10https://gerrit.wikimedia.org/r/498057 [11:56:32] (03Merged) 10jenkins-bot: debian: Add release 0.15 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/498056 (https://phabricator.wikimedia.org/T218862) (owner: 10Vgutierrez) [11:57:50] (03PS2) 10MarcoAurelio: contint: change `/r/p/` to `/r/` for gerrit links [puppet] - 10https://gerrit.wikimedia.org/r/498057 (https://phabricator.wikimedia.org/T218844) [11:58:12] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/498057 (https://phabricator.wikimedia.org/T218844) (owner: 10MarcoAurelio) [11:58:53] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: include python-cliff into jessie-wikimedia/openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/498058 (https://phabricator.wikimedia.org/T216497) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T1200) [12:00:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: include python-cliff into jessie-wikimedia/openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/498058 (https://phabricator.wikimedia.org/T216497) (owner: 10Arturo Borrero Gonzalez) [12:01:29] (03CR) 10Muehlenhoff: "Thanks for fixing up the tests! I'm adding a few more of the logging folks for input on the actual autp restart." [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:02:29] !log uploaded acme-chief 0.15 to apt.wikimedia.org (buster) - T218862 [12:02:31] vgutierrez: Failed to log message to wiki. Somebody should check the error logs. [12:02:32] T218862: acme-chief >0.13 generates wrong metadata for the endpoint used in file based deployment layout - https://phabricator.wikimedia.org/T218862 [12:02:48] (03CR) 10Dzahn: [C: 03+1] "confirmed Gerrit changed the URLs like that in the past and new links work to git clone from" [puppet] - 10https://gerrit.wikimedia.org/r/498057 (https://phabricator.wikimedia.org/T218844) (owner: 10MarcoAurelio) [12:03:21] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10jijiki) @Papaul then the issue is somewhere else: ` [Thu Mar 21 11:02:31 2019] mce: [Hardware Error]: Machine check events logged [Thu Mar 21 11:02:31 2019] EDAC sbridge... [12:05:21] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10MoritzMuehlenhoff) It could be simply a broken CPU? If we have such the CPU type in a decom host, we could loot it from there. [12:05:23] (03PS1) 10Vgutierrez: "Revert "Revert "acme_chief: Update acme_chief::cert resource to fetch several cert versions"""" [puppet] - 10https://gerrit.wikimedia.org/r/498059 [12:06:11] (03CR) 10jerkins-bot: [V: 04-1] "Revert "Revert "acme_chief: Update acme_chief::cert resource to fetch several cert versions"""" [puppet] - 10https://gerrit.wikimedia.org/r/498059 (owner: 10Vgutierrez) [12:06:13] (03CR) 10Vgutierrez: [C: 04-2] "Merge after upgrading to acme-chief 0.15 and disabling puppet in the acme-chief clients as a precaution measure." [puppet] - 10https://gerrit.wikimedia.org/r/498059 (owner: 10Vgutierrez) [12:08:31] !log T216497 add python-cliff to jessie-wikimedia/openstack-mitaka-jessie [12:08:33] arturo: Failed to log message to wiki. Somebody should check the error logs. [12:08:36] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [12:09:05] (03PS2) 10Vgutierrez: acme_chief: Update acme_chief::cert resource to fetch several cert versions [puppet] - 10https://gerrit.wikimedia.org/r/498059 [12:09:47] maybe once we have !log fixed we can .. go through our (channel) logs and re-log all the !log lines :p [12:10:42] mutante: puppet compiler says okay on contint1001/2001 but has also some .err files [12:11:01] I was going to copy paste them to SAL directly so we get the right times [12:11:06] -\_o_/- [12:11:25] hauskatze: unfortunately can't always trust it.. it has said OK to me before but when i click the actual links it was clearly failed :/ [12:11:59] well I trust the patch will be carefully reviewed before being merged :) [12:11:59] apergos: oh.. even better [12:12:26] but i want to wait until it's fixed (or maybe do a batch at the end of each week, if we're not back by tomorrow evening) [12:12:31] PROBLEM - puppet last run on restbase1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:35] hauskatze: careful = don't trust jenkins vote on experimental .. yea [12:12:47] apergos: *nod* [12:12:58] (03PS8) 10GTirloni: openstack - Convert cron jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) [12:14:41] hauskatze: you can add the person from https://phabricator.wikimedia.org/T218844#5042719 [12:15:23] hauskatze: no errors in this case https://puppet-compiler.wmflabs.org/compiler1001/129/ [12:15:40] which error lines ? [12:16:17] https://puppet-compiler.wmflabs.org/compiler1001/129/contint1001.wikimedia.org/change.contint1001.wikimedia.org.err [12:16:26] (03PS2) 10Ema: ATS: SystemTap probe for origin server connections [puppet] - 10https://gerrit.wikimedia.org/r/498031 (https://phabricator.wikimedia.org/T213263) [12:16:43] last two links at https://puppet-compiler.wmflabs.org/compiler1001/129/contint1001.wikimedia.org/ [12:16:54] and the same for the contint2001 one [12:17:08] like I said, I don't understand PC much [12:17:50] (03CR) 10GTirloni: "Removed Python code formatting changes." [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [12:17:51] (03CR) 10Ema: [C: 03+2] ATS: SystemTap probe for origin server connections [puppet] - 10https://gerrit.wikimedia.org/r/498031 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [12:18:52] (03PS6) 10Arturo Borrero Gonzalez: Re-apply "openstack::clientpackages::common: include python3 packages" [puppet] - 10https://gerrit.wikimedia.org/r/497009 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [12:19:46] (03PS3) 10GTirloni: labmon: Delete archived Graphite directories [puppet] - 10https://gerrit.wikimedia.org/r/496130 (https://phabricator.wikimedia.org/T218185) [12:22:28] (03CR) 10GTirloni: [C: 03+2] labmon: Delete archived Graphite directories [puppet] - 10https://gerrit.wikimedia.org/r/496130 (https://phabricator.wikimedia.org/T218185) (owner: 10GTirloni) [12:24:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC makes sense: https://puppet-compiler.wmflabs.org/compiler1002/15249/" [puppet] - 10https://gerrit.wikimedia.org/r/497009 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [12:24:28] (03PS7) 10Arturo Borrero Gonzalez: Re-apply "openstack::clientpackages::common: include python3 packages" [puppet] - 10https://gerrit.wikimedia.org/r/497009 (https://phabricator.wikimedia.org/T218423) (owner: 10Alex Monk) [12:30:55] PROBLEM - puppet last run on cloudvirt1027 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [12:31:23] PROBLEM - puppet last run on cloudnet1003 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [12:31:37] PROBLEM - puppet last run on cloudvirt1026 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 4 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [12:32:17] PROBLEM - puppet last run on cloudvirt1028 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 3 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [12:32:33] known in other channel ^ [12:33:08] !log Pooling mw1290 back [12:33:09] jijiki: Failed to log message to wiki. Somebody should check the error logs. [12:33:42] (03PS1) 10Arturo Borrero Gonzalez: Revert "Re-apply "openstack::clientpackages::common: include python3 packages"" [puppet] - 10https://gerrit.wikimedia.org/r/498063 [12:34:42] (03PS1) 10Arturo Borrero Gonzalez: Revert "aptrepo: include python-cliff into jessie-wikimedia/openstack-mitaka-jessie" [puppet] - 10https://gerrit.wikimedia.org/r/498064 [12:34:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "Re-apply "openstack::clientpackages::common: include python3 packages"" [puppet] - 10https://gerrit.wikimedia.org/r/498063 (owner: 10Arturo Borrero Gonzalez) [12:34:49] PROBLEM - puppet last run on cloudvirt1019 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [12:35:24] !log Pooling mw1339 back [12:35:25] jijiki: Failed to log message to wiki. Somebody should check the error logs. [12:35:26] (03PS5) 10GTirloni: toollabs::k8s::worker - Allow prometheus to access read-only metrics port 10255 [puppet] - 10https://gerrit.wikimedia.org/r/486142 (https://phabricator.wikimedia.org/T214512) [12:35:53] PROBLEM - puppet last run on cloudvirt1012 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 6 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [12:35:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "aptrepo: include python-cliff into jessie-wikimedia/openstack-mitaka-jessie" [puppet] - 10https://gerrit.wikimedia.org/r/498064 (owner: 10Arturo Borrero Gonzalez) [12:36:05] (03PS2) 10Arturo Borrero Gonzalez: Revert "aptrepo: include python-cliff into jessie-wikimedia/openstack-mitaka-jessie" [puppet] - 10https://gerrit.wikimedia.org/r/498064 [12:36:33] (03CR) 10jerkins-bot: [V: 04-1] toollabs::k8s::worker - Allow prometheus to access read-only metrics port 10255 [puppet] - 10https://gerrit.wikimedia.org/r/486142 (https://phabricator.wikimedia.org/T214512) (owner: 10GTirloni) [12:36:59] PROBLEM - puppet last run on cloudnet2002-dev is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 4 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[python3-novaclient],Package[python3-glanceclient],Package[python3-keystoneclient],Package[python3-openstackclient] [12:40:16] !log T216497 remove python-cliff from jessie-wikimedia/openstack-mitaka-jessie [12:40:17] arturo: Failed to log message to wiki. Somebody should check the error logs. [12:40:18] T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497 [12:40:42] (03CR) 10Paladox: [C: 03+1] contint: change `/r/p/` to `/r/` for gerrit links [puppet] - 10https://gerrit.wikimedia.org/r/498057 (https://phabricator.wikimedia.org/T218844) (owner: 10MarcoAurelio) [12:41:50] (03PS6) 10GTirloni: toollabs::k8s::worker - Allow prometheus to access read-only metrics port 10255 [puppet] - 10https://gerrit.wikimedia.org/r/486142 (https://phabricator.wikimedia.org/T214512) [12:41:53] RECOVERY - puppet last run on cloudnet1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:42:58] (03CR) 10jerkins-bot: [V: 04-1] toollabs::k8s::worker - Allow prometheus to access read-only metrics port 10255 [puppet] - 10https://gerrit.wikimedia.org/r/486142 (https://phabricator.wikimedia.org/T214512) (owner: 10GTirloni) [12:43:50] (03CR) 10Hashar: "WMCS instances seem to have base::standard_packages included somehow. For CI we get curl installed via:" [puppet] - 10https://gerrit.wikimedia.org/r/498045 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [12:44:07] RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:44:53] (03Abandoned) 10GTirloni: toollabs::k8s::worker - Allow prometheus to access read-only metrics port 10255 [puppet] - 10https://gerrit.wikimedia.org/r/486142 (https://phabricator.wikimedia.org/T214512) (owner: 10GTirloni) [12:45:53] (03PS4) 10GTirloni: profile::base::labs - Ability to disable Puppet failure emails [puppet] - 10https://gerrit.wikimedia.org/r/495670 (https://phabricator.wikimedia.org/T218009) [12:47:21] 10Operations, 10media-storage: ms-be1043 - /dev/sdk disappeared - https://phabricator.wikimedia.org/T218875 (10Dzahn) [12:48:42] 10Operations, 10media-storage: ms-be1043 - /dev/sdk disappeared - https://phabricator.wikimedia.org/T218875 (10Dzahn) [12:50:16] ACKNOWLEDGEMENT - puppet last run on ms-be1043 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 18 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdk] daniel_zahn https://phabricator.wikimedia.org/T218875 [12:51:43] (03CR) 10Gehel: [C: 04-1] elasticsearch: add profile for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:53:34] 10Operations, 10media-storage: ms-be1043 - /dev/sdk disappeared - https://phabricator.wikimedia.org/T218875 (10MoritzMuehlenhoff) There's already https://phabricator.wikimedia.org/T218544 [12:54:32] (03PS1) 10ArielGlenn: generate a minimal config file for 'misc' dumps [puppet] - 10https://gerrit.wikimedia.org/r/498066 (https://phabricator.wikimedia.org/T205825) [12:54:51] (03CR) 10GTirloni: [C: 03+2] profile::base::labs - Ability to disable Puppet failure emails [puppet] - 10https://gerrit.wikimedia.org/r/495670 (https://phabricator.wikimedia.org/T218009) (owner: 10GTirloni) [12:56:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is just introducing the hiera key/values, we also need code that actually looks up these keys." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497316 (owner: 10Ladsgroup) [12:57:15] RECOVERY - puppet last run on cloudvirt1027 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [12:57:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Waiting for Filippo to weigh in about collecting all these extra metrics" [puppet] - 10https://gerrit.wikimedia.org/r/497848 (https://phabricator.wikimedia.org/T209108) (owner: 10Eevans) [12:57:59] RECOVERY - puppet last run on cloudvirt1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:00:04] zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - European version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T1300). [13:01:44] (03Abandoned) 10GTirloni: openldap: Use newer slapd from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/495861 (https://phabricator.wikimedia.org/T217280) (owner: 10GTirloni) [13:02:11] RECOVERY - puppet last run on cloudvirt1012 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:03:19] RECOVERY - puppet last run on cloudnet2002-dev is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:03:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497848 (https://phabricator.wikimedia.org/T209108) (owner: 10Eevans) [13:03:53] RECOVERY - puppet last run on cloudvirt1028 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [13:03:57] (03PS1) 10Alexandros Kosiaris: Revert "Revert "Switchover oresrdb.svc.codfw.wmnet for kernel upgrades"" [dns] - 10https://gerrit.wikimedia.org/r/498067 [13:04:09] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Revert "Revert "Switchover oresrdb.svc.codfw.wmnet for kernel upgrades"" [dns] - 10https://gerrit.wikimedia.org/r/498067 (owner: 10Alexandros Kosiaris) [13:06:06] (03PS1) 10Muehlenhoff: Add Cumin alias for sessionstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/498069 [13:06:07] !log installing Java security updates on stat hosts [13:06:08] moritzm: Failed to log message to wiki. Somebody should check the error logs. [13:06:25] RECOVERY - puppet last run on cloudvirt1019 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:06:45] (03PS9) 10GTirloni: openstack - Convert cron jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) [13:08:18] (03CR) 10GTirloni: [C: 03+2] openstack - Convert cron jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [13:12:35] (03PS2) 10Muehlenhoff: Add Cumin alias for sessionstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/498069 [13:14:20] (03PS4) 10Arturo Borrero Gonzalez: openstack: Follow-up I71678b27: Remove stray MariaDB reference [puppet] - 10https://gerrit.wikimedia.org/r/497445 (https://phabricator.wikimedia.org/T218009) (owner: 10Alex Monk) [13:14:25] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for sessionstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/498069 (owner: 10Muehlenhoff) [13:14:45] (03PS2) 10ArielGlenn: generate a minimal config file for 'misc' dumps [puppet] - 10https://gerrit.wikimedia.org/r/498066 (https://phabricator.wikimedia.org/T205825) [13:15:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: Follow-up I71678b27: Remove stray MariaDB reference [puppet] - 10https://gerrit.wikimedia.org/r/497445 (https://phabricator.wikimedia.org/T218009) (owner: 10Alex Monk) [13:17:00] (03PS5) 10Arturo Borrero Gonzalez: openstack: Follow-up I71678b27: Remove stray MariaDB reference [puppet] - 10https://gerrit.wikimedia.org/r/497445 (https://phabricator.wikimedia.org/T218009) (owner: 10Alex Monk) [13:18:00] !log downtimed cloudcontrol*, cloudservices*, labcontrol*, labweb* (T210818) [13:18:02] gtirloni: Failed to log message to wiki. Somebody should check the error logs. [13:18:02] T210818: Move admin cron jobs to systemd timers - https://phabricator.wikimedia.org/T210818 [13:19:23] (03PS1) 10Zfilipin: all wikis to 1.33.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498078 [13:19:25] (03CR) 10Zfilipin: [C: 03+2] all wikis to 1.33.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498078 (owner: 10Zfilipin) [13:19:35] (03PS1) 10Gehel: elasticsearch: upgrade to elastic 6.5.4 for cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498079 (https://phabricator.wikimedia.org/T218878) [13:19:37] (03PS1) 10Gehel: elasticsearch: deploy elasticsearch config for ES6 cirrus / codfw [puppet] - 10https://gerrit.wikimedia.org/r/498080 (https://phabricator.wikimedia.org/T218878) [13:20:34] (03PS6) 10Arturo Borrero Gonzalez: wmcs: decommision several codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/497293 (https://phabricator.wikimedia.org/T218023) [13:20:52] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498078 (owner: 10Zfilipin) [13:21:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Since this patch is removing a lot of stuff, I'm waiting for Andrew to review it." [puppet] - 10https://gerrit.wikimedia.org/r/497293 (https://phabricator.wikimedia.org/T218023) (owner: 10Arturo Borrero Gonzalez) [13:22:16] 10Operations, 10Cassandra, 10Core Platform Team Kanban (Blocked Externally), 10Services (blocked), 10User-Eevans: puppetize turning off reserved space for cassandra /srv - https://phabricator.wikimedia.org/T132632 (10fgiunchedi) >>! In T132632#5023911, @Eevans wrote: >>>! In T132632#5023075, @mobrovac wr... [13:22:32] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.22 [13:23:09] (03CR) 10Alex Monk: "So wait why do we want to permit opt-out of these emails?" [puppet] - 10https://gerrit.wikimedia.org/r/495670 (https://phabricator.wikimedia.org/T218009) (owner: 10GTirloni) [13:24:29] (03PS1) 10Gehel: elasticsearch: upgrade to elastic 6.5.4 for cirrus / eqiad [puppet] - 10https://gerrit.wikimedia.org/r/498082 (https://phabricator.wikimedia.org/T218879) [13:24:31] (03PS1) 10Gehel: elasticsearch: deploy elasticsearch config for ES6 cirrus / eqiad [puppet] - 10https://gerrit.wikimedia.org/r/498083 (https://phabricator.wikimedia.org/T218879) [13:24:32] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [13:24:42] onimisionipe, dcausse: ^^^ [13:25:54] (03PS1) 10Esanders: VE section editing: Enable mobile AB test on remaining target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498084 (https://phabricator.wikimedia.org/T218851) [13:26:12] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: upgrade to elastic 6.5.4 for cirrus / eqiad [puppet] - 10https://gerrit.wikimedia.org/r/498082 (https://phabricator.wikimedia.org/T218879) (owner: 10Gehel) [13:26:14] (03CR) 10DCausse: [C: 03+1] elasticsearch: upgrade to elastic 6.5.4 for cirrus / eqiad [puppet] - 10https://gerrit.wikimedia.org/r/498082 (https://phabricator.wikimedia.org/T218879) (owner: 10Gehel) [13:26:22] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: deploy elasticsearch config for ES6 cirrus / eqiad [puppet] - 10https://gerrit.wikimedia.org/r/498083 (https://phabricator.wikimedia.org/T218879) (owner: 10Gehel) [13:26:31] (03CR) 10DCausse: [C: 03+1] elasticsearch: deploy elasticsearch config for ES6 cirrus / eqiad [puppet] - 10https://gerrit.wikimedia.org/r/498083 (https://phabricator.wikimedia.org/T218879) (owner: 10Gehel) [13:27:33] PROBLEM - Apache HTTP on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:27:45] PROBLEM - Nginx local proxy to apache on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:28:07] 10Operations, 10media-storage: ms-be1043 - /dev/sdk disappeared - https://phabricator.wikimedia.org/T218875 (10Dzahn) [13:28:10] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10monitoring, 10Patch-For-Review: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10Dzahn) [13:28:39] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:28:53] RECOVERY - Nginx local proxy to apache on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:28:59] !log installing Java security updates on notebook hosts [13:28:59] moritzm: Failed to log message to wiki. Somebody should check the error logs. [13:30:56] (03CR) 10CDanis: [C: 03+1] Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:31:35] (03PS33) 10Mathew.onipe: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) [13:31:37] (03CR) 10Gehel: [C: 03+2] elasticsearch: upgrade to elastic 6.5.4 for cirrus / eqiad [puppet] - 10https://gerrit.wikimedia.org/r/498082 (https://phabricator.wikimedia.org/T218879) (owner: 10Gehel) [13:32:36] (03PS1) 10GTirloni: openstack - Fix errors in timers definitions [puppet] - 10https://gerrit.wikimedia.org/r/498085 (https://phabricator.wikimedia.org/T210818) [13:34:15] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498086 [13:35:25] (03CR) 10GTirloni: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/495670 (https://phabricator.wikimedia.org/T218009) (owner: 10GTirloni) [13:36:12] (03CR) 10Mathew.onipe: "PCC output look Ok but someone else must check: https://puppet-compiler.wmflabs.org/compiler1002/15254/" [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [13:37:18] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498086 (owner: 10Marostegui) [13:37:20] (03PS2) 10GTirloni: openstack - Fix errors in timers definitions [puppet] - 10https://gerrit.wikimedia.org/r/498085 (https://phabricator.wikimedia.org/T210818) [13:37:47] !log upgrade openjdk-8 on an-worker1080 and restarted hadoop daemons [13:37:48] elukey: Failed to log message to wiki. Somebody should check the error logs. [13:38:23] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498086 (owner: 10Marostegui) [13:38:28] (03CR) 10GTirloni: [C: 03+2] openstack - Fix errors in timers definitions [puppet] - 10https://gerrit.wikimedia.org/r/498085 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [13:38:39] (03PS3) 10ArielGlenn: generate a minimal config file for 'misc' dumps [puppet] - 10https://gerrit.wikimedia.org/r/498066 (https://phabricator.wikimedia.org/T205825) [13:38:58] (03CR) 10Gehel: elasticsearch: add profile for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496782 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [13:39:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3317 (duration: 00m 51s) [13:39:25] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [13:44:56] (03PS1) 10Dzahn: parsoid::testing: install PHP7 [puppet] - 10https://gerrit.wikimedia.org/r/498089 (https://phabricator.wikimedia.org/T213493) [13:47:32] (03CR) 10Andrew Bogott: "*bump*" [puppet] - 10https://gerrit.wikimedia.org/r/497069 (owner: 10Andrew Bogott) [13:48:11] 10Operations, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): labweb - wikitech_run_jobs.service in failed state - https://phabricator.wikimedia.org/T218881 (10GTirloni) p:05Triage→03Normal [13:48:39] !log reboot oresrdb2001 [13:48:39] akosiaris: Failed to log message to wiki. Somebody should check the error logs. [13:49:48] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15258/" [puppet] - 10https://gerrit.wikimedia.org/r/498089 (https://phabricator.wikimedia.org/T213493) (owner: 10Dzahn) [13:49:53] PROBLEM - Check systemd state on labtestweb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:50:21] ^^ checking [13:50:26] (03CR) 10CDanis: [C: 03+1] "LGTM with one question" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497930 (https://phabricator.wikimedia.org/T218784) (owner: 10Cwhite) [13:50:55] RECOVERY - Check systemd state on labtestweb2001 is OK: OK - running: The system is fully operational [13:51:36] well, I didn't even manage to SSH in yet, glad it's back [13:53:04] (03CR) 10Dzahn: "Notice: /Stage[main]/Packages::Php/Package[php]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/498089 (https://phabricator.wikimedia.org/T213493) (owner: 10Dzahn) [13:54:18] !log disabling puppet in acme-chief clients - T218862 [13:54:20] vgutierrez: Failed to log message to wiki. Somebody should check the error logs. [13:54:21] T218862: acme-chief >0.13 generates wrong metadata for the endpoint used in file based deployment layout - https://phabricator.wikimedia.org/T218862 [13:55:48] 10Operations, 10Parsoid-PHP, 10Patch-For-Review: Install PHP7 on scandium - https://phabricator.wikimedia.org/T213493 (10Dzahn) >>! In T213493#5009396, @ssastry wrote: > Now that scandium is operational, can you install PHP7 on scandium? Sorry for the delay. PHP 7 has been installed with the change above.... [13:56:09] 10Operations, 10Parsoid-PHP, 10Patch-For-Review: Install PHP7 on scandium - https://phabricator.wikimedia.org/T213493 (10Dzahn) 05Open→03Resolved [13:56:23] (03PS1) 10Alexandros Kosiaris: Revert "Revert "Revert "Switchover oresrdb.svc.codfw.wmnet for kernel upgrades""" [dns] - 10https://gerrit.wikimedia.org/r/498091 [13:56:29] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Revert "Revert "Revert "Switchover oresrdb.svc.codfw.wmnet for kernel upgrades""" [dns] - 10https://gerrit.wikimedia.org/r/498091 (owner: 10Alexandros Kosiaris) [13:58:12] !log update acme-chief to version 0.15 in acmechief1001 - T218862 [13:58:14] vgutierrez: Failed to log message to wiki. Somebody should check the error logs. [13:59:27] PROBLEM - Check systemd state on labtestweb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:00:38] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Update acme_chief::cert resource to fetch several cert versions [puppet] - 10https://gerrit.wikimedia.org/r/498059 (owner: 10Vgutierrez) [14:00:49] (03PS3) 10Vgutierrez: acme_chief: Update acme_chief::cert resource to fetch several cert versions [puppet] - 10https://gerrit.wikimedia.org/r/498059 [14:01:43] RECOVERY - Check systemd state on labtestweb2001 is OK: OK - running: The system is fully operational [14:02:27] ?? [14:02:32] why does this even page? [14:03:30] (03CR) 10Gehel: [C: 03+2] elasticsearch: deploy elasticsearch config for ES6 cirrus / eqiad [puppet] - 10https://gerrit.wikimedia.org/r/498083 (https://phabricator.wikimedia.org/T218879) (owner: 10Gehel) [14:03:39] (03PS2) 10Gehel: elasticsearch: deploy elasticsearch config for ES6 cirrus / eqiad [puppet] - 10https://gerrit.wikimedia.org/r/498083 (https://phabricator.wikimedia.org/T218879) [14:03:42] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10CDanis) The memory address is the same in all of these error reports. That suggests to me that one of the DIMMs has a 'stuck' bit and that it is unlikely to be a CPU issu... [14:03:55] arturo: because somewhere in Hiera there is "profile::base::notifications: critical" for openstack roles . most likely [14:04:13] it wouldnt for other prod hosts by default [14:04:28] 10Operations, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): labweb - wikitech_run_jobs.service in failed state - https://phabricator.wikimedia.org/T218881 (10GTirloni) Same happening on labtestweb2001 [14:04:46] somewhere under role/eqiad/wmcs/openstack/main/ [14:05:50] * arturo sigh [14:06:39] it has been actively turned on at some point afaict [14:06:53] for certain roles [14:07:10] (03PS11) 10KartikMistry: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) [14:07:11] anyway this host is about to be decom [14:07:36] i would say schedule a downtime of 3 months or so [14:07:43] and it shouldnt happen anymore [14:07:49] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [14:07:49] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [14:08:54] 10Operations, 10Parsoid-PHP, 10Patch-For-Review: Install PHP7 on scandium - https://phabricator.wikimedia.org/T213493 (10ssastry) Oh 7.0? Isn't production on PHP 7.1 / 7.2? [14:08:58] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) [14:09:04] FYI mutante T218024 [14:09:04] T218024: decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 [14:09:25] !log T218024 disabled icinga checks for labtestweb2001 [14:09:27] arturo: Failed to log message to wiki. Somebody should check the error logs. [14:09:43] arturo: ACK. once the checkbox with role(spare::system) is done that should also stop the paging [14:10:16] ack [14:10:23] (03PS4) 10ArielGlenn: generate a minimal config file for 'misc' dumps [puppet] - 10https://gerrit.wikimedia.org/r/498066 (https://phabricator.wikimedia.org/T205825) [14:11:29] !log re-enabling puppet in acme-chief clients - T218862 [14:11:30] vgutierrez: Failed to log message to wiki. Somebody should check the error logs. [14:11:31] T218862: acme-chief >0.13 generates wrong metadata for the endpoint used in file based deployment layout - https://phabricator.wikimedia.org/T218862 [14:11:50] (03CR) 10ArielGlenn: [C: 03+2] generate a minimal config file for 'misc' dumps [puppet] - 10https://gerrit.wikimedia.org/r/498066 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn) [14:14:19] 10Operations, 10Parsoid-PHP, 10Patch-For-Review: Install PHP7 on scandium - https://phabricator.wikimedia.org/T213493 (10ssastry) >>! In T213493#5043882, @ssastry wrote: > Oh 7.0? Isn't production on PHP 7.1 / 7.2? T216102#4954452 says production is on 7.2. What is involved in getting the same version on sc... [14:14:42] (03PS3) 10Jbond: Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 [14:17:14] 10Operations, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): labweb - wikitech_run_jobs.service in failed state - https://phabricator.wikimedia.org/T218881 (10GTirloni) Additionally, labtestweb2001 is paging for this error while labweb1001 is not. Pages are not expected because the timer is not de... [14:17:57] 10Operations, 10Parsoid-PHP, 10Patch-For-Review: Install PHP7 on scandium - https://phabricator.wikimedia.org/T213493 (10Dzahn) @ssastry app servers like mwdebug have 7.0 and 7.2 but looking at a prod parsoid server like wtp1025 there is also 7.0 there. It should match wtp servers, right? [14:18:04] !log downtimed labtestweb2001 (T218881) [14:18:06] gtirloni: Failed to log message to wiki. Somebody should check the error logs. [14:18:07] T218881: labweb - wikitech_run_jobs.service in failed state - https://phabricator.wikimedia.org/T218881 [14:18:48] (03CR) 10jerkins-bot: [V: 04-1] Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [14:20:06] 10Operations, 10Parsoid-PHP, 10Patch-For-Review: Install PHP7 on scandium - https://phabricator.wikimedia.org/T213493 (10Dzahn) ` (23) wtp[1025-1042,1044-1048].eqiad.wmnet ----- OUT... [14:23:45] (03CR) 10Muehlenhoff: "Looks good, two comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/497481 (owner: 10Jbond) [14:24:39] 10Operations, 10Parsoid-PHP, 10Patch-For-Review: Install PHP7 on scandium - https://phabricator.wikimedia.org/T213493 (10Dzahn) @Joe Shouldn't i match the version on the wtp servers? But that would be conflicting with T216102#4954452 . @ssastry I will prepare a patch to add 7.2 and add reviewers to clear it up. [14:26:43] (03CR) 10Andrew Bogott: [C: 03+1] "I'd double-check that there aren't any local databases on labtestweb2001; otherwise this seems fine." [puppet] - 10https://gerrit.wikimedia.org/r/497293 (https://phabricator.wikimedia.org/T218023) (owner: 10Arturo Borrero Gonzalez) [14:26:45] 10Operations, 10Parsoid-PHP, 10Patch-For-Review: Install PHP7 on scandium - https://phabricator.wikimedia.org/T213493 (10Dzahn) 05Resolved→03Open [14:27:25] (03CR) 10BBlack: [C: 04-1] Add lvs to the read-only ldap replicas (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [14:31:47] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.634e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:31:54] (03PS1) 10GTirloni: wikitech: Do not hide output from mwscript runjobs [puppet] - 10https://gerrit.wikimedia.org/r/498099 (https://phabricator.wikimedia.org/T218881) [14:31:56] hm! [14:32:59] Lag seems for cirrus search write only [14:33:01] ACKing that, ya [14:33:08] (03PS2) 10GTirloni: wikitech: Do not hide output from mwscript runjobs [puppet] - 10https://gerrit.wikimedia.org/r/498099 (https://phabricator.wikimedia.org/T218881) [14:33:57] ACKNOWLEDGEMENT - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.634e+05 gt 1e+05 ottomata Only affecting CirrusSearchElasticaWrite, and is due to a jump in volume for that topic https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:34:08] (03CR) 10GTirloni: [C: 03+2] wikitech: Do not hide output from mwscript runjobs [puppet] - 10https://gerrit.wikimedia.org/r/498099 (https://phabricator.wikimedia.org/T218881) (owner: 10GTirloni) [14:34:26] (03CR) 10Andrew Bogott: Add lvs to the read-only ldap replicas (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [14:35:14] (03PS6) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) [14:35:19] !log restarging jenkins on releases* after Java update [14:35:19] moritzm: Failed to log message to wiki. Somebody should check the error logs. [14:41:49] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:42:31] (03PS1) 10Dzahn: parsoid::testing: install PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/498104 (https://phabricator.wikimedia.org/T213493) [14:43:05] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:43:25] (03CR) 10jerkins-bot: [V: 04-1] parsoid::testing: install PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/498104 (https://phabricator.wikimedia.org/T213493) (owner: 10Dzahn) [14:43:47] (03PS1) 10GTirloni: wikitech: Fix error in timer specification (every 1s -> 1min) [puppet] - 10https://gerrit.wikimedia.org/r/498105 (https://phabricator.wikimedia.org/T218881) [14:44:08] zeljkof: I guess the partial revert can unblock the train while proper investigation is done to check all scenarios? [14:44:12] Is that good enough? [14:44:25] (03CR) 10GTirloni: [C: 03+2] wikitech: Fix error in timer specification (every 1s -> 1min) [puppet] - 10https://gerrit.wikimedia.org/r/498105 (https://phabricator.wikimedia.org/T218881) (owner: 10GTirloni) [14:45:09] xSavitar: yes. please fix the error, get it merged into master, backported into wmf.22 and deployed in one of the swat windows today or on Monday [14:45:33] (03PS2) 10Dzahn: parsoid::testing: install PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/498104 (https://phabricator.wikimedia.org/T213493) [14:46:00] Okay, I've uploaded a patch, waiting for Jenkins to signal green then maybe you can merge it. Don't want to self merge [14:46:20] zeljkof: Once it's merged, I can create a backport patch to .22 [14:46:20] (03CR) 10jerkins-bot: [V: 04-1] parsoid::testing: install PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/498104 (https://phabricator.wikimedia.org/T213493) (owner: 10Dzahn) [14:46:31] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T218776 (10Papaul) a:05Papaul→03Marostegui Disk replacement complete [14:46:32] and schedule a SWAT for today [14:47:00] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T218776 (10Marostegui) Thanks ` physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Rebuilding) ` [14:47:06] xSavitar: I'm not familiar with that code, you'll have to find somebody else to review and merge it [14:47:33] (03PS3) 10Dzahn: parsoid::testing: install PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/498104 (https://phabricator.wikimedia.org/T213493) [14:47:36] Not sure if Fomafix is around, no idea of timezones related to this user :( [14:48:50] xSavitar: it's more important to get it fixed correctly than to rush and introduce more trouble : [14:48:52] :) [14:49:11] (03PS1) 10GTirloni: wikitech: Add missing zero in time specification [puppet] - 10https://gerrit.wikimedia.org/r/498107 (https://phabricator.wikimedia.org/T218881) [14:49:27] You're right! Actually, the fix I uploaded was previous code (revert) [14:49:52] (03CR) 10GTirloni: [C: 03+2] wikitech: Add missing zero in time specification [puppet] - 10https://gerrit.wikimedia.org/r/498107 (https://phabricator.wikimedia.org/T218881) (owner: 10GTirloni) [14:50:00] I'll have to sit on it later and find out all possible scenarios. So this fix is just to unblock the train, that's all [14:50:16] Other things can happen later, that is why I asked if the partial revert is good enough in this case zeljkof [14:51:12] I don't really know that code. Whatever gets rid of the error message and does not introduce new problems is good with me. :) [14:52:03] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:52:23] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "Uh, I'm sorry? The links are all broken ("Not Found")." [puppet] - 10https://gerrit.wikimedia.org/r/498057 (https://phabricator.wikimedia.org/T218844) (owner: 10MarcoAurelio) [14:53:17] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:53:19] zeljkof: Nice then! Then the patch I submitted should be fine. [14:53:29] Now, I just need to find someone to give a helping hand in review [14:53:35] xSavitar: please do [14:53:40] I can't really help you there [14:53:56] (03PS1) 10Vgutierrez: librenms: Switch to the directory based deployment used by acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/498109 (https://phabricator.wikimedia.org/T207295) [14:54:06] (03PS7) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) [14:55:42] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for rsync/releases [puppet] - 10https://gerrit.wikimedia.org/r/498110 (https://phabricator.wikimedia.org/T135991) [14:56:03] xSavitar: which patch ? [14:56:31] thedj: Thanks, this one: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/498098 [14:56:33] (03CR) 10Dzahn: [C: 03+1] "try them with git clone instead of in browser though" [puppet] - 10https://gerrit.wikimedia.org/r/498057 (https://phabricator.wikimedia.org/T218844) (owner: 10MarcoAurelio) [14:56:54] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for rsync/releases [puppet] - 10https://gerrit.wikimedia.org/r/498110 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:57:04] xSavitar: checking [14:57:09] Thanks a lot! [14:58:13] (03CR) 10Dzahn: [C: 03+1] "cc: @Paladox ^" [puppet] - 10https://gerrit.wikimedia.org/r/498057 (https://phabricator.wikimedia.org/T218844) (owner: 10MarcoAurelio) [14:58:43] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for rsync/releases [puppet] - 10https://gerrit.wikimedia.org/r/498110 (https://phabricator.wikimedia.org/T135991) [14:58:45] 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): labweb - wikitech_run_jobs.service in failed state - https://phabricator.wikimedia.org/T218881 (10GTirloni) Due to an error in the timer specification, mwscript was running every second instead of every minute. That... [14:59:01] (03CR) 10Paladox: [C: 03+1] "> cc: @Paladox ^" [puppet] - 10https://gerrit.wikimedia.org/r/498057 (https://phabricator.wikimedia.org/T218844) (owner: 10MarcoAurelio) [14:59:03] 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): labweb - wikitech_run_jobs.service in failed state - https://phabricator.wikimedia.org/T218881 (10GTirloni) 05Open→03Resolved [14:59:04] xSavitar: got it, understand it, seems safe and sane. [14:59:21] thedj: Okay, want to land it? [14:59:46] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for rsync/releases [puppet] - 10https://gerrit.wikimedia.org/r/498110 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:00:01] Awesome, thank you thedj :) [15:00:06] \0/ [15:00:21] RECOVERY - Device not healthy -SMART- on db2052 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2052&var-datasource=codfw+prometheus/ops [15:03:09] (03PS8) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) [15:03:42] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/498109 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [15:04:20] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "The old links work both with "git clone" as well as in the browser. Why replace them with partly broken ones?" [puppet] - 10https://gerrit.wikimedia.org/r/498057 (https://phabricator.wikimedia.org/T218844) (owner: 10MarcoAurelio) [15:04:43] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for rsync/releases [puppet] - 10https://gerrit.wikimedia.org/r/498110 (https://phabricator.wikimedia.org/T135991) [15:06:38] jouncebot: next [15:06:38] In 0 hour(s) and 53 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T1600) [15:06:46] jouncebot: now [15:06:46] No deployments scheduled for the next 0 hour(s) and 53 minute(s) [15:06:55] (03PS9) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) [15:07:12] Amir1: did you say you are not here for swat for https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMessages/+/498097/ ? [15:07:36] very likely not [15:07:37] Amir1: i believe I also will not be :/ [15:07:53] not sure if we could convince whoever is running swat to just do it without us? :D [15:08:47] PROBLEM - PHD should be running on phab1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [15:09:06] zeljkof, thedj, how can one verify if such error still shows up after deploying via SWAT? [15:09:15] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [15:09:20] I don't have access to logstash so I'm not sure if I can verify [15:10:08] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for rsync/releases [puppet] - 10https://gerrit.wikimedia.org/r/498110 (https://phabricator.wikimedia.org/T135991) [15:10:18] xSavitar: /me neither [15:10:22] xSavitar: You mean https://phabricator.wikimedia.org/T218883? [15:10:29] Daimona: Yes [15:10:35] * Daimona is checking [15:10:36] which reminds me that i should get my account approved for logstash [15:10:44] thedj: :) [15:10:48] Daimona: Thanks for checking [15:10:55] since i have the NDA for it already [15:11:07] When was it deployed? [15:11:07] thedj: What is the procedure? I'll like to apply [15:11:25] Daimona: Once the patch submitted gets merged to master, I'll make a backport to .22 [15:11:31] And then schedule that for deploy [15:11:42] thedj: yes you should :) [15:12:10] Daimona: Today most likely [15:12:23] That is why I'll need someone to be there to verify that the error disappears [15:12:25] It will continue to show up as long as the bad code is inproduction [15:12:44] But yeah, I can check later if no-one's around [15:12:47] xSavitar: anybody with kibana access can check the logs https://wikitech.wikimedia.org/wiki/Logstash [15:12:49] Daimona: Yeah, thedj has CR+2'd this: https://gerrit.wikimedia.org/r/#/c/498098/ [15:13:09] Daimona: Okay, I'll poke you then [15:13:23] Sure, feel free to do so :) [15:14:15] Daimona: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/498116 (just created) [15:14:19] Maybe you can watch that? [15:14:34] Yup, done [15:14:53] zeljkof: Not sure if I have that [15:16:33] jouncebot: next [15:16:34] In 0 hour(s) and 43 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T1600) [15:16:56] Daimona: Not sure if that is the right one, right? Let me check WT Deployments calendar [15:17:34] Should be the one at 19:00 [15:17:44] Well actually 18:00 UTC [15:18:10] I guess, unless this bug is breaking lots of things and it needs to be backported now [15:19:10] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/494516 (https://phabricator.wikimedia.org/T208843) (owner: 10BryanDavis) [15:19:20] (03PS1) 10Ottomata: eventgate-analytics - allow for extra app config in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/498118 [15:19:24] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for dbprov200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/498119 [15:19:42] Daimona: No, it's not breaking too many things [15:19:44] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and production DNS for dbprov200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/498119 (owner: 10Papaul) [15:19:52] So I'll schedule for 18:00 UTC [15:20:07] i'm actually wondering if it didn't exist before as well.. [15:20:32] !log rebooting flerovium/furud for kernel updates [15:20:41] moritzm: Failed to log message to wiki. Somebody should check the error logs. [15:20:42] i suspected it was just getting caught in the api before and now no longer, but i havent' found the spot where that would be. [15:25:53] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 22 processes with UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [15:26:12] 10Operations, 10hardware-requests: eqiad: requesting dual cpu misc host for icinga1001 replacement - https://phabricator.wikimedia.org/T215837 (10RobH) 05Open→03Declined We've returned the original hardware (post multiple repairs) back to service, so this is no longer needed. Thank you for the approval th... [15:26:20] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - allow for extra app config in values [deployment-charts] - 10https://gerrit.wikimedia.org/r/498118 (owner: 10Ottomata) [15:26:24] Daimona: Looks good? https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T1800 [15:26:45] RECOVERY - PHD should be running on phab1001 is OK: PROCS OK: 1 process with regex args php ./phd-daemon, UID = 498 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [15:27:05] I think so, although I don't guarantee I'll be around at the time :/ [15:27:36] 10Operations, 10WMF-NDA-Requests: Volunteer NDA for Alex Monk - https://phabricator.wikimedia.org/T218448 (10bd808) Removing #sre-access-requests tag. Consensus from off task inquiries is that L2 is all that is needed at this time. [15:28:49] Hmmm..... [15:29:01] (03PS3) 10Eevans: prometheus: collect session storage Cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/497848 (https://phabricator.wikimedia.org/T209108) [15:36:47] 10Operations, 10WMF-NDA-Requests: Volunteer NDA for Alex Monk - https://phabricator.wikimedia.org/T218448 (10Krenair) Ok, I've signed L2. [15:44:17] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Addshore) >>! In T212189#5020187, @akosiaris wrote: > > Thanks for the understanding. We are drafting next quarter goals this week, I 'll... [15:44:29] (03PS1) 10Muehlenhoff: Remove support for trusty/Ubuntu in kernel/sysctl settings [puppet] - 10https://gerrit.wikimedia.org/r/498123 [15:47:09] 10Operations, 10WMF-NDA-Requests: Volunteer NDA for Alex Monk - https://phabricator.wikimedia.org/T218448 (10jcrespo) May I ask to clarify what "Cloud-wide root" means? Maybe it is clear for everyone, but not to me. For example, would someone with those privileges have root access to wikireplicas (which is on... [15:47:12] (03PS1) 10GTirloni: role::labs::instance - Do not instantiate diamond if diamond::remove is true [puppet] - 10https://gerrit.wikimedia.org/r/498124 (https://phabricator.wikimedia.org/T218365) [15:47:55] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) >>! In T212189#5044451, @Addshore wrote: >>>! In T212189#5020187, @akosiaris wrote: >> >> Thanks for the understanding. We are... [15:48:31] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [15:50:24] (03PS1) 10Muehlenhoff: Remove support for Ubuntu/trusty in base packages [puppet] - 10https://gerrit.wikimedia.org/r/498126 [15:51:23] (03CR) 10CRusnov: "> Patch Set 9:" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496836 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [15:52:28] (03CR) 10Vgutierrez: [C: 03+2] librenms: Switch to the directory based deployment used by acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/498109 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [15:52:37] (03PS2) 10Vgutierrez: librenms: Switch to the directory based deployment used by acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/498109 (https://phabricator.wikimedia.org/T207295) [15:53:01] jynus: Too many tildles! (you need exactly two ~~ to strike something out in phab) [15:53:21] ~~:-/ [15:53:58] bawolff_: thanks [15:54:32] My biggest qualm with phabricator is it doesn't use wikisyntax and I'm constantly writing the wrong formatting stuff [15:55:00] I can hancle mw and restructured [15:55:03] *handle [15:55:13] if there was a single version of the second [15:56:01] I spent way too much time 10 years ago getting "advanced" knowledge of wiki syntax. Once was enough, don't need to do that again ;) [15:56:59] (03PS1) 10Muehlenhoff: Remove support for Ubuntu/trusty in monitoring/metrics base classes [puppet] - 10https://gerrit.wikimedia.org/r/498130 [15:58:02] bawolff_: at some point one just uses html, which technically is valid mw [15:58:24] Definitely any time a table is involved [15:58:43] +1 [15:58:53] wiki table syntax is the worst. Html table syntax is logical and straightforward [15:58:59] nah, wiki is shorter if table is simple [15:59:09] If simple [15:59:20] the problem is if you want many fancy stuff [15:59:24] But if you need colspan or even format a single cell then wiki sucks [15:59:30] (03PS1) 10Marostegui: db-codfw.php: Depool db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498133 (https://phabricator.wikimedia.org/T218336) [15:59:32] yeah [16:00:04] godog and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:34] (03PS1) 10Muehlenhoff: Remove support for Ubuntu in apt/debmonitor base classes [puppet] - 10https://gerrit.wikimedia.org/r/498134 [16:00:35] It took me like 3 yrars or more not to just switch to html whenever I needed to write a table [16:00:36] 10Operations, 10WMF-NDA-Requests: Volunteer NDA for Alex Monk - https://phabricator.wikimedia.org/T218448 (10Krenair) I'm not familiar with wikireplicas on VMs (I was under the impression that the DB replicas were on physical hardware without any virtualisation, but maybe that changed or you're referring to so... [16:00:44] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498133 (https://phabricator.wikimedia.org/T218336) (owner: 10Marostegui) [16:00:51] And that is while wiki markup is supposed to be easier [16:01:02] !log Poweroff db2096 for onsite maintenance T218336 [16:01:04] marostegui: Failed to log message to wiki. Somebody should check the error logs. [16:01:05] T218336: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 [16:01:11] (03PS1) 10Elukey: mapred-site.xml: test [puppet/cdh] - 10https://gerrit.wikimedia.org/r/498135 [16:01:45] (03CR) 10Volans: "Nice, virtual +1" [puppet] - 10https://gerrit.wikimedia.org/r/498134 (owner: 10Muehlenhoff) [16:01:58] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498133 (https://phabricator.wikimedia.org/T218336) (owner: 10Marostegui) [16:03:07] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2096 for onsite maintenance (duration: 00m 50s) [16:03:07] marostegui@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [16:04:44] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T218776 (10Marostegui) 05Open→03Resolved This is now good! ` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 1I:1:2 (port 1I:box... [16:05:20] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [16:06:18] (03PS2) 10Elukey: mapred-site.xml: test [puppet/cdh] - 10https://gerrit.wikimedia.org/r/498135 [16:11:24] (03CR) 10Bstorm: [C: 03+1] role::labs::instance - Do not instantiate diamond if diamond::remove is true (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498124 (https://phabricator.wikimedia.org/T218365) (owner: 10GTirloni) [16:11:32] (03CR) 10Ottomata: [C: 03+1] "Sure!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/498135 (owner: 10Elukey) [16:11:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/498110 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:12:12] (03PS1) 10Ladsgroup: Add wikimaniawiki to another special group in Wikibase client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498140 (https://phabricator.wikimedia.org/T217730) [16:12:34] (03CR) 10Herron: "since the bots aren't able to update the associated security task I'll mention here that an approach related to this is in place as descri" [puppet] - 10https://gerrit.wikimedia.org/r/497930 (https://phabricator.wikimedia.org/T218784) (owner: 10Cwhite) [16:13:25] (03PS1) 10GTirloni: profile::base::labs - Convert cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/498141 (https://phabricator.wikimedia.org/T210818) [16:14:21] When I attempt to add a new email address to my gerrit account, I get a "Server error: invalid token" message [16:14:40] (03CR) 10Jbond: [V: 03+2] Add option to filter out services which don't actually need a restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/497496 (owner: 10Jbond) [16:14:44] Is it a http 500 error? [16:14:46] (03CR) 10GTirloni: role::labs::instance - Do not instantiate diamond if diamond::remove is true (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498124 (https://phabricator.wikimedia.org/T218365) (owner: 10GTirloni) [16:14:47] The verification email gets sent to the new email but when I try verifying, it says the token is invalid and doesn't add the email [16:14:48] (03CR) 10jerkins-bot: [V: 04-1] profile::base::labs - Convert cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/498141 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [16:14:52] (03CR) 10Jbond: [V: 03+2 C: 03+2] Add option to filter out services which don't actually need a restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/497496 (owner: 10Jbond) [16:14:56] oh [16:15:08] other users have reported that [16:15:09] paladox: No, it's just a little flash at the bottom left [16:15:21] Ah, so emails can't be added? [16:15:29] well they can [16:15:31] use GWTUI [16:15:40] Okay, let me try that [16:15:45] https://bugs.chromium.org/p/gerrit/issues/detail?id=10489 [16:15:54] and https://bugs.chromium.org/p/gerrit/issues/detail?id=10117 [16:16:03] and https://bugs.chromium.org/p/gerrit/issues/detail?id=10062 [16:16:47] there's 8 bugs filled about this [16:17:55] (03PS2) 10GTirloni: profile::base::labs - Convert cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/498141 (https://phabricator.wikimedia.org/T210818) [16:18:43] (03CR) 10jerkins-bot: [V: 04-1] profile::base::labs - Convert cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/498141 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [16:19:00] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) Status update: all instances but `ops` on `prometheus1003` have finished migrating. This time I changed the strategy to actually backfill... [16:19:37] paladox: Okay [16:19:42] oh #%# i need to go through another NDA signing to fix my ldap ? [16:19:43] GWTUI worked, thanks a lot [16:20:13] i'll look at fixing that later [16:20:24] i'll happily stay ignorant in that case I think. [16:20:29] paladox: So if one has 2 emails tied to the same gerrit account, emails get sent only to the email marked as preferred email right? [16:20:31] paladox: Thanks [16:20:46] (03PS3) 10GTirloni: profile::base::labs - Convert cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/498141 (https://phabricator.wikimedia.org/T210818) [16:20:54] Im not exactly sure, but yes i think so [16:21:24] (03CR) 10jerkins-bot: [V: 04-1] profile::base::labs - Convert cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/498141 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [16:21:33] (03CR) 10Dzahn: [C: 03+2] Enable base::service_auto_restart for rsync/releases [puppet] - 10https://gerrit.wikimedia.org/r/498110 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:21:45] (03PS5) 10Dzahn: Enable base::service_auto_restart for rsync/releases [puppet] - 10https://gerrit.wikimedia.org/r/498110 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:22:21] (03PS1) 10Jforrester: SDC: Enable Depicts functionality on TestCommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498144 (https://phabricator.wikimedia.org/T218913) [16:22:27] (03PS5) 10Arturo Borrero Gonzalez: toolforge: remove obsolete mailrelay manifests [puppet] - 10https://gerrit.wikimedia.org/r/494516 (https://phabricator.wikimedia.org/T208843) (owner: 10BryanDavis) [16:22:52] (03PS1) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [16:22:54] (03PS4) 10GTirloni: profile::base::labs - Convert cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/498141 (https://phabricator.wikimedia.org/T210818) [16:23:57] (03CR) 10Jforrester: [C: 04-2] "Not yet. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) (owner: 10Jforrester) [16:24:40] OK to do a (test)prod config change? PuppetSWAT looks quiet/over? [16:24:40] (03CR) 10GTirloni: [C: 03+1] toolforge: remove obsolete mailrelay manifests [puppet] - 10https://gerrit.wikimedia.org/r/494516 (https://phabricator.wikimedia.org/T208843) (owner: 10BryanDavis) [16:25:59] (03PS6) 10Arturo Borrero Gonzalez: toolforge: remove obsolete mailrelay manifests [puppet] - 10https://gerrit.wikimedia.org/r/494516 (https://phabricator.wikimedia.org/T208843) (owner: 10BryanDavis) [16:26:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: remove obsolete mailrelay manifests [puppet] - 10https://gerrit.wikimedia.org/r/494516 (https://phabricator.wikimedia.org/T208843) (owner: 10BryanDavis) [16:29:03] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [16:29:03] gehel@cumin2001: Failed to log message to wiki. Somebody should check the error logs. [16:29:08] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [16:29:08] gehel@cumin2001: Failed to log message to wiki. Somebody should check the error logs. [16:31:58] (03PS9) 10Jbond: debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 [16:32:37] (03CR) 10Jbond: debdeploy: add config to filter out services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/497481 (owner: 10Jbond) [16:33:36] (03CR) 10Dzahn: "no issues on releses1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/498110 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:34:45] 10Operations, 10WMF-NDA-Requests: Volunteer NDA for Alex Monk - https://phabricator.wikimedia.org/T218448 (10bd808) >>! In T218448#5044466, @jcrespo wrote: > May I ask to clarify what "Cloud-wide root" means? Maybe it is clear for everyone, but not to me. For example, would someone with those privileges have r... [16:35:36] (03CR) 10Jforrester: [C: 03+2] SDC: Enable Depicts functionality on TestCommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498144 (https://phabricator.wikimedia.org/T218913) (owner: 10Jforrester) [16:36:53] (03Merged) 10jenkins-bot: SDC: Enable Depicts functionality on TestCommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498144 (https://phabricator.wikimedia.org/T218913) (owner: 10Jforrester) [16:36:57] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) [16:38:02] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [16:38:03] gehel@cumin2001: Failed to log message to wiki. Somebody should check the error logs. [16:38:08] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [16:38:09] gehel@cumin2001: Failed to log message to wiki. Somebody should check the error logs. [16:38:14] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=97) [16:38:15] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [16:38:42] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [16:38:42] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [16:38:44] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=97) [16:38:44] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [16:39:50] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) I just scanned for databases in this host: ` aborrero@labtestweb2001:~ 16s $ sudo mysql -u root [...] MariaDB... [16:39:53] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [16:39:53] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [16:45:35] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) [16:45:55] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) [16:46:32] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) Allow me to edit the title to not confuse it with the same task that will be filed for eqiad :-D [16:47:20] Amir1: Something happened to simplewiki [16:47:32] Someone made the interface non-simple... [16:47:33] What's up? [16:47:34] It just changed [16:47:36] Idk what happened [16:47:38] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) 05Open→03Stalled [16:47:39] Like...today. [16:47:50] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) p:05Triage→03Normal [16:47:57] Not sure who to contact [16:48:21] Do you have phabricator account? [16:48:28] o/ [16:48:30] (03PS2) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [16:48:33] (03PS1) 10Jforrester: SDC: Enable Depicts property on Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498150 [16:48:46] Amir1: Yep. [16:48:48] (03CR) 10Jforrester: [C: 03+2] SDC: Enable Depicts property on Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498150 (owner: 10Jforrester) [16:48:53] It seems to have defaulted to the normal en one. [16:48:58] I have to go afk in a minute [16:49:02] BRPever: can you file a ticket? [16:49:21] Well not all of it changed... [16:49:40] (03CR) 10jerkins-bot: [V: 04-1] SDC: Enable Depicts property on Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498150 (owner: 10Jforrester) [16:49:46] (03CR) 10jerkins-bot: [V: 04-1] SDC: Enable Depicts property on Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498150 (owner: 10Jforrester) [16:49:48] (03CR) 10jerkins-bot: [V: 04-1] SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) (owner: 10Jforrester) [16:50:17] James_F: RIP [16:50:38] Eurgh, phpcs failure. [16:50:52] Vermont: BRPever If you explain the problem in more depth, I probably can help [16:50:52] Vermont: I am still not sure how many things changed... I saw couple [16:50:55] marktraceur: If only running composer locally didn't put the config directory in an undeployable state. :-( [16:50:59] with examples [16:51:04] Tch, details [16:51:11] (03PS7) 10Arturo Borrero Gonzalez: wmcs: decommision several codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/497293 (https://phabricator.wikimedia.org/T218023) [16:51:25] Eh...changes became edits [16:51:29] In “My edits” [16:51:53] and in RC too... [16:51:53] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) [16:51:57] Yep. [16:52:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: decommision several codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/497293 (https://phabricator.wikimedia.org/T218023) (owner: 10Arturo Borrero Gonzalez) [16:52:23] (03PS2) 10Jforrester: SDC: Enable Depicts property on Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498150 [16:52:25] (03PS3) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [16:52:42] (03CR) 10Jforrester: [C: 03+2] SDC: Enable Depicts property on Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498150 (owner: 10Jforrester) [16:52:46] Vermont: I think Random article was "Show any article" [16:53:00] Yes, it was that too [16:53:10] Here are two changes https://usercontent.irccloud-cdn.com/file/gyWC2T1F/IMG_1485.PNG [16:53:20] Both said “changes”, not “edits” [16:53:49] (03Merged) 10jenkins-bot: SDC: Enable Depicts property on Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498150 (owner: 10Jforrester) [16:55:26] no idea what happened https://usercontent.irccloud-cdn.com/file/rdO8aOwR/IMG_1486.PNG [16:57:57] https://simple.wikipedia.org/w/index.php?title=MediaWiki:Mycontris&diff=2032169&oldid=1994648 [16:59:27] (03PS4) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [16:59:29] (03PS1) 10Jforrester: SDC: Set remote entity search URI default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498152 [16:59:38] but that happened 9 years ago. nvm [16:59:46] 10Operations, 10Horizon, 10Traffic, 10Upstream, 10cloud-services-team (Kanban): Horizon Designate dashboard not allowing creation of NS records - https://phabricator.wikimedia.org/T204013 (10GTirloni) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T1700). [17:00:18] (03PS1) 10ArielGlenn: move various 'misc' ('other') dumps to using separate config file [puppet] - 10https://gerrit.wikimedia.org/r/498153 (https://phabricator.wikimedia.org/T205825) [17:00:49] (03CR) 10Jforrester: [C: 03+2] SDC: Set remote entity search URI default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498152 (owner: 10Jforrester) [17:01:01] Amir1: any idea what happened? [17:01:13] A bunch of the interface isn’t simple anymore :( [17:01:22] * Vermont declares the end of the world [17:01:33] Vermont: This will be fixed soon. I make a phabricator ticket [17:01:34] (joking obviously) [17:01:37] k thanks :) [17:02:01] (03Merged) 10jenkins-bot: SDC: Set remote entity search URI default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498152 (owner: 10Jforrester) [17:02:05] RECOVERY - puppet last run on cloudnet2001-dev is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:02:41] BRPever: Just got an email. Someone seems to be finally dealing with the phab ticket saying we’re not in the new interwiki links. [17:02:49] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [17:02:49] gehel@cumin2001: Failed to log message to wiki. Somebody should check the error logs. [17:02:55] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [17:02:55] gehel@cumin2001: Failed to log message to wiki. Somebody should check the error logs. [17:03:00] !log gehel@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=97) [17:03:00] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [17:03:01] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [17:03:02] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [17:03:41] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10GTirloni) [17:05:55] PROBLEM - nova-compute proc minimum on labtestvirt2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:06:28] Vermont: one idea is to edit https://simple.wikipedia.org/wiki/MediaWiki:Mycontris to see if it causes a refresh in the l10n cache [17:06:41] (03CR) 10Cwhite: [C: 03+1] "I think this is a good idea." [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:07:40] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SDC: Enable Depicts on TestCommons, with related config (duration: 00m 50s) [17:07:41] jforrester@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [17:08:13] Amir1: Is there a reason it’s Mycontris and not Mycontribs? [17:09:01] Okay, that fixed it [17:09:19] No longer says “My edits” for me; says changes. [17:09:24] Although I didn’t actually save an edit. [17:09:50] Historical reasons I guess. [17:10:10] you can find the values by using uselang=qqx (like https://simple.wikipedia.org/wiki/Special:RecentChanges?hidebots=1&hidecategorization=1&hideWikibase=1&namespace=8&limit=50&days=30&urlversion=2&uselang=qqx) [17:10:22] Erm... [17:10:34] Logged out, it says contributions. https://usercontent.irccloud-cdn.com/file/vMUrVjFC/IMG_1487.PNG [17:10:49] (it wasn’t that before) [17:11:25] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10aborrero) a:05aborrero→03RobH [17:11:33] (03PS2) 10ArielGlenn: move various 'misc' ('other') dumps to using separate config file [puppet] - 10https://gerrit.wikimedia.org/r/498153 (https://phabricator.wikimedia.org/T205825) [17:11:47] PROBLEM - nova-compute proc maximum on labtestvirt2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:11:48] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10aborrero) [17:11:53] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [17:11:53] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [17:11:54] jouncebot: now [17:11:54] For the next 0 hour(s) and 48 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T1700) [17:11:56] jouncebot: next [17:11:56] In 0 hour(s) and 48 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T1800) [17:12:14] (03PS5) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [17:12:15] (03PS1) 10Jforrester: SDC: Add test-commons.wikimedia.org to wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498155 [17:12:19] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10aborrero) a:05aborrero→03RobH [17:12:30] (03CR) 10ArielGlenn: [C: 03+2] move various 'misc' ('other') dumps to using separate config file [puppet] - 10https://gerrit.wikimedia.org/r/498153 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn) [17:12:32] Reedy: You need anything? [17:12:35] Amir1: Is there a possibility my creation of that page changed it? [17:13:00] James_F|Away: Why are you showing as |away for me on tab complete? :P [17:13:05] But weren't when you replied [17:13:10] everything is back to normal I guess [17:13:15] It might caused a broken cache flush, it should not happen anyway [17:13:20] James_F|Away: Want to deploy tgr's patch to hopefully unbreak OAuth [17:13:35] 10Operations, 10WMF-NDA-Requests: Volunteer NDA for Alex Monk - https://phabricator.wikimedia.org/T218448 (10Aklapper) I can confirm that @Krenair has signed L2. [17:13:48] I'm not |away, I'm here. [17:14:01] My bouncer is maybe lagging [17:14:07] Reedy: i think its your client mine shows just James_F [17:14:16] Reedy: Fixed now? [17:14:30] Nope [17:14:35] James_F|Away: lol [17:14:42] I just disconnected from my bouncer and back again [17:14:47] I've not been |Away since early March. [17:14:56] Now it is [17:15:05] Reedy: Anyway, please deploy away. [17:15:13] (03PS4) 10Jbond: Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 [17:15:15] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10GTirloni) [17:15:29] Reedy: Also, you OK to sign-off https://gerrit.wikimedia.org/r/498155 ? :-) [17:15:37] Amir1: seems to be fixed, thanks :D [17:15:41] 10Operations, 10cloud-services-team (Kanban): Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10GTirloni) Have we seen this recently? [17:15:58] (03CR) 10Reedy: [C: 03+1] "It's a wiki!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498155 (owner: 10Jforrester) [17:16:19] Reedy: Ta. :-) [17:16:35] Reedy: I'll just sling that out now then. [17:16:39] Fine by me [17:16:46] It'll take jerkins 10-20 mins anyway [17:16:46] (03CR) 10Jforrester: [C: 03+2] SDC: Add test-commons.wikimedia.org to wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498155 (owner: 10Jforrester) [17:17:00] Indeed. :-) [17:17:11] So I can deploy from a train [17:17:12] Much excite [17:18:01] Have you not learned your lesson, Reedy ? [17:18:09] Trains are fine [17:18:10] Reedy: Is it a proper ICE or something terrible from British National Network Naming System Rail? [17:18:10] Boats are not [17:18:14] James_F: SJ [17:18:47] What type of locomotive is mw train? [17:19:01] (03Merged) 10jenkins-bot: SDC: Add test-commons.wikimedia.org to wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498155 (owner: 10Jforrester) [17:19:03] a scap train [17:19:24] (03PS1) 10ArielGlenn: switch over wikidata entity dumps to use the misc dump config file [puppet] - 10https://gerrit.wikimedia.org/r/498159 (https://phabricator.wikimedia.org/T205825) [17:19:35] Zppix: Something like https://en.wikipedia.org/wiki/Brio_(company) [17:19:47] marktraceur: I saw someone at the airport looking for you [17:19:58] They were holding up a Holmquist sign [17:20:00] (03CR) 10jerkins-bot: [V: 04-1] Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [17:20:09] Reedy: We're Everywhere. [17:20:44] (03CR) 10ArielGlenn: [C: 04-1] "Do not merge until after this week's wd entity dumps complete (likely Friday)." [puppet] - 10https://gerrit.wikimedia.org/r/498159 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn) [17:21:29] (03Abandoned) 10ArielGlenn: dumps: set up a minimal config file for 'other' dumps [puppet] - 10https://gerrit.wikimedia.org/r/463711 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn) [17:23:37] (03PS10) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) [17:27:20] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: SDC: Add test-commons.wikimedia.org to wgCrossSiteAJAXdomains (duration: 00m 49s) [17:27:22] jforrester@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [17:27:46] (03PS5) 10Jbond: Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 [17:28:17] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [17:28:17] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [17:30:09] Back in a few, will deploy my patch before SWAT [17:30:19] (03PS1) 10ArielGlenn: start wikidata entity dumps on the 1st and 20th of each month [puppet] - 10https://gerrit.wikimedia.org/r/498164 (https://phabricator.wikimedia.org/T216160) [17:31:08] (03CR) 10jerkins-bot: [V: 04-1] start wikidata entity dumps on the 1st and 20th of each month [puppet] - 10https://gerrit.wikimedia.org/r/498164 (https://phabricator.wikimedia.org/T216160) (owner: 10ArielGlenn) [17:32:36] (03PS6) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [17:32:37] (03PS1) 10Jforrester: SDC: Enable EntitySourceBasedFederation on TestCommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498165 [17:33:26] (03CR) 10ArielGlenn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/498164 (https://phabricator.wikimedia.org/T216160) (owner: 10ArielGlenn) [17:34:02] (03PS1) 10Sbisson: Disable Welcome survey on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498166 [17:34:04] (03CR) 10jerkins-bot: [V: 04-1] start wikidata entity dumps on the 1st and 20th of each month [puppet] - 10https://gerrit.wikimedia.org/r/498164 (https://phabricator.wikimedia.org/T216160) (owner: 10ArielGlenn) [17:34:51] (03PS1) 10DannyS712: Change `/r/p/` to `/r/` for gerrit links [puppet] - 10https://gerrit.wikimedia.org/r/498167 [17:35:25] (03PS2) 10ArielGlenn: start wikidata entity dumps on the 1st and 20th of each month [puppet] - 10https://gerrit.wikimedia.org/r/498164 (https://phabricator.wikimedia.org/T216160) [17:35:38] (03CR) 10Jbond: "Ready for a review i think" [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [17:35:44] !log disabled puppet on lvs1002 + lvs1005 for new service rollout [17:35:44] bblack: Failed to log message to wiki. Somebody should check the error logs. [17:36:04] (03CR) 10Jforrester: [C: 03+2] SDC: Enable EntitySourceBasedFederation on TestCommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498165 (owner: 10Jforrester) [17:36:10] (03CR) 10Andrew Bogott: [C: 03+2] Add lvs to the read-only ldap replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) (owner: 10Andrew Bogott) [17:36:18] (03PS1) 10Jforrester: [DNM] SDC: Point TestCommons at TestWikidata, not real Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498168 [17:36:22] (03PS11) 10Andrew Bogott: Add lvs to the read-only ldap replicas [puppet] - 10https://gerrit.wikimedia.org/r/496858 (https://phabricator.wikimedia.org/T218133) [17:37:16] (03Merged) 10jenkins-bot: SDC: Enable EntitySourceBasedFederation on TestCommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498165 (owner: 10Jforrester) [17:37:19] (03PS2) 10DannyS712: Change '/r/p/' to '/r/' for gerrit links [puppet] - 10https://gerrit.wikimedia.org/r/498167 [17:37:26] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [17:37:26] gehel@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [17:37:36] (03CR) 10ArielGlenn: [C: 04-1] "Do not merge until the end of the month" [puppet] - 10https://gerrit.wikimedia.org/r/498164 (https://phabricator.wikimedia.org/T216160) (owner: 10ArielGlenn) [17:37:57] PROBLEM - puppet last run on ms-be1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:38:16] (03PS2) 10Sbisson: Disable Welcome survey on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498166 (https://phabricator.wikimedia.org/T218920) [17:39:07] (03PS2) 10Jcrespo: mariadb-snapshots: Allow the option to only postprocess snapshots [puppet] - 10https://gerrit.wikimedia.org/r/498029 (https://phabricator.wikimedia.org/T210292) [17:42:14] (03PS3) 10Catrope: Disable Welcome survey on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498166 (https://phabricator.wikimedia.org/T218920) (owner: 10Sbisson) [17:42:21] (03CR) 10Catrope: [C: 03+1] Disable Welcome survey on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498166 (https://phabricator.wikimedia.org/T218920) (owner: 10Sbisson) [17:42:55] (03PS2) 10Jcrespo: mariadb-backups: Make sure retention is handled correctly [puppet] - 10https://gerrit.wikimedia.org/r/498024 (https://phabricator.wikimedia.org/T210292) [17:43:06] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SDC: Enable EntitySourceBasedFederation on TestCommons (duration: 00m 50s) [17:43:07] jforrester@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [17:43:08] (03PS3) 10Jcrespo: mariadb-backups: Make sure retention is handled correctly [puppet] - 10https://gerrit.wikimedia.org/r/498024 (https://phabricator.wikimedia.org/T210292) [17:43:20] (03PS3) 10Jcrespo: mariadb-snapshots: Allow the option to only postprocess snapshots [puppet] - 10https://gerrit.wikimedia.org/r/498029 (https://phabricator.wikimedia.org/T210292) [17:43:40] !log restarting pybal on lvs1005 [17:43:40] bblack: Failed to log message to wiki. Somebody should check the error logs. [17:46:15] PROBLEM - PyBal connections to etcd on lvs1005 is CRITICAL: CRITICAL: 10 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:48:02] 10Operations, 10cloud-services-team, 10Upstream: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290 (10Bstorm) [17:49:31] PROBLEM - PyBal IPVS diff check on lvs1005 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.252:636, 208.80.154.252:389]) https://wikitech.wikimedia.org/wiki/PyBal [17:49:47] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.22/includes/user/User.php: Iab24923c613d6aeed4b574f587fc4cee8f33077c (duration: 00m 51s) [17:49:58] * Reedy pokes stashbot [17:50:03] * Reedy throws things at stashbot [17:50:10] * Reedy kicks stashbot [17:50:20] bd808: Want to try rebooting it? :) [17:50:44] :) [17:50:51] Not supposed to break it reedy [17:50:57] It's what I'm here for [17:51:05] reedy@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [17:51:08] Reedy: fix it and you get wikilove [17:51:31] RECOVERY - PyBal connections to etcd on lvs1005 is OK: OK: 12 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:51:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:53:58] (03PS1) 10CDanis: phd: restart on failures [puppet] - 10https://gerrit.wikimedia.org/r/498170 [17:54:47] RECOVERY - PyBal IPVS diff check on lvs1005 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:54:48] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [17:54:50] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [17:54:50] !log otto@deploy1001 scap-helm eventgate-analytics finished [17:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:19] !log restarting pybal on lvs1002 [17:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:39] !log everything back to normal for lvs1002/lvs1005 (high-traffic2 @ eqiad) [17:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:48] It logs?! [17:56:56] Aye [17:57:40] bd808: sweeeeet [17:58:03] magic! [17:59:20] (03PS3) 10Bmansurov: Disable reader trust survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494552 (https://phabricator.wikimedia.org/T217576) [17:59:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T1800). [18:00:05] bmansurov, bmansurov, Amir1, xSavitar, Daimona, and stephanebisson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:07] here [18:00:14] hi [18:00:25] o/ [18:00:49] (03PS4) 10Bmansurov: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496857 (https://phabricator.wikimedia.org/T213969) [18:01:14] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [18:01:15] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [18:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:15] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:19] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:01:53] (03PS1) 10Herron: modsec directive @ipMatchFromFile broken for apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/498171 [18:03:02] If no one else wants to do it, I can SWAT [18:03:49] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494552 (https://phabricator.wikimedia.org/T217576) (owner: 10Bmansurov) [18:04:09] RECOVERY - puppet last run on ms-be1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:04:52] (03Merged) 10jenkins-bot: Disable reader trust survey v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494552 (https://phabricator.wikimedia.org/T217576) (owner: 10Bmansurov) [18:05:06] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [18:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:09] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [18:05:09] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:03] bmansurov: Your first change is on mwdebug1002. Can you test? [18:06:10] stephanebisson: ok, testing [18:07:06] (03PS2) 10Herron: modsec directive @ipMatchFromFile broken for apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/498171 [18:07:08] (03PS5) 10Sbisson: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496857 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [18:08:14] mwdebu1002 is unusually slow [18:08:29] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [18:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:39] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:09:55] There's some timeouts in the log. [18:10:06] stephanebisson: looks good, please deploy everywhere [18:10:26] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496857 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [18:11:02] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:494552|Disable reader trust survey v2]] (duration: 00m 50s) [18:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:49] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:12:55] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [18:12:56] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [18:12:56] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:29] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [18:13:30] (03Merged) 10jenkins-bot: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496857 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [18:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:58] (03CR) 10Cwhite: "Great start! Comments inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [18:14:26] bmansurov: Your *second* change is on mwdebug1002. Can you test? [18:14:33] Amir1: are you around? [18:14:36] stephanebisson: ok testing [18:15:28] stephanebisson: looks good, please continue [18:16:43] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:496857|Enable logging for CitationUsage and CitationUsagePageLoad]] (duration: 00m 49s) [18:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:50] stephanebisson: no, I removed mine a while back jouncebot didn't update it [18:16:57] Sorry [18:17:19] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [18:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:53] Amir1: I still see it on the page after a refresh, maybe you updated the wrong SWAT window ? ;) [18:18:00] stephanebisson: thanks for deploying my changes! [18:19:05] xSavitar: are you around? [18:19:17] Yes I am :) [18:19:36] xSavitar: just checking.. you change is next [18:19:45] Okay, I'll keep an eye [18:21:19] (03PS4) 10Sbisson: Disable Welcome survey on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498166 (https://phabricator.wikimedia.org/T218920) [18:21:29] (03CR) 10Sbisson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498166 (https://phabricator.wikimedia.org/T218920) (owner: 10Sbisson) [18:23:07] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [18:23:08] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [18:23:08] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:47] (03Merged) 10jenkins-bot: Disable Welcome survey on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498166 (https://phabricator.wikimedia.org/T218920) (owner: 10Sbisson) [18:25:15] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:498166|Disable Welcome survey on viwiki]] (duration: 00m 49s) [18:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:55] stephanebisson: The issue with this change is it doesn't have a visible way to test it [18:26:08] Someone needs to login to logstash and see if the errors have disappear [18:26:27] Maybe you could help me with ideas on how this one works, Daimona was supposed to be around to help me check that out [18:26:41] (03PS1) 10GTirloni: openstack::glance::image_sync - Fix systemd timer user [puppet] - 10https://gerrit.wikimedia.org/r/498193 (https://phabricator.wikimedia.org/T210818) [18:26:59] I'm around [18:27:05] Sorry for being late to the party [18:28:17] (03CR) 10GTirloni: [C: 03+2] openstack::glance::image_sync - Fix systemd timer user [puppet] - 10https://gerrit.wikimedia.org/r/498193 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [18:28:30] xSavitar, Daimona: is it OK if I sync the change and let you two monitor logstash? [18:28:31] * Daimona is checking logstash, it'll take some time to see if it goes away [18:28:37] Yup [18:28:44] Yes [18:29:21] This SWAT window has another 30 minutes to it so if anything is not as expected, ping me and I'll revert the change or deploy another patch. [18:29:41] ^ [18:30:24] !log gehel@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [18:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:31] hey xSavitar I see you fixed up the log on wikitech with all the back entries (https://wikitech.wikimedia.org/wiki/Server_Admin_Log), thanks so much! it was on my todo list and now it's not [18:33:51] apergos: Great! :) [18:34:03] Invoking Reedy here! [18:34:15] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:34:51] well you did the work so thanks again! [18:35:47] +1, thanks xSavitar. [18:36:02] Awesome James, you're welcome sir! [18:36:25] apergos: you're welcome <3 [18:36:38] !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.22/languages/Language.php: SWAT: [[gerrit:498116|languages: Partial revert of I8287118cf8ec01326ead9]] (duration: 00m 50s) [18:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:45] xSavitar, Daimona: The change is deployed. [18:37:24] Daimona has access to LS, I don't yet ATM so he can monitor [18:37:30] In fact, I think he's monitoring already [18:38:18] (03PS1) 10GTirloni: openstack::keystone::cleanup - Do not hide `keystone-manage token_flush` output [puppet] - 10https://gerrit.wikimedia.org/r/498199 (https://phabricator.wikimedia.org/T210818) [18:39:54] (03CR) 10GTirloni: [C: 03+2] openstack::keystone::cleanup - Do not hide `keystone-manage token_flush` output [puppet] - 10https://gerrit.wikimedia.org/r/498199 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [18:41:25] Yes [18:41:32] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [18:41:33] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [18:41:33] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:03] (03PS4) 10Volans: PuppetDB backend: allow to override URL scheme in config [software/cumin] - 10https://gerrit.wikimedia.org/r/497309 (https://phabricator.wikimedia.org/T218441) (owner: 10TheAnarcat) [18:44:03] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [18:44:07] I'm not seeing it anymore [18:44:17] Last one was 8 minutes ago [18:48:28] Daimona: Okay! Nice! [18:48:38] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): labnet/ labtestnet2001 - disk space - nova-api.log needs rotation - https://phabricator.wikimedia.org/T153279 (10GTirloni) [18:48:48] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): labnet/ labtestnet2001 - disk space - nova-api.log needs rotation - https://phabricator.wikimedia.org/T153279 (10GTirloni) p:05Triage→03Normal [18:49:05] !log resetting archived settings on elasticsearch cirrus eqiad - T218879 [18:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:08] T218879: Upgrade to elasticsearch 6.5.4 for cirrus / eqiad - https://phabricator.wikimedia.org/T218879 [18:51:53] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10GTirloni) [18:52:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497481 (owner: 10Jbond) [18:54:38] (03PS6) 10Jbond: Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 [18:55:22] 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Migrate references from $instance.eqiad.wmflabs to $instance.$project.eqiad.wmflabs - https://phabricator.wikimedia.org/T153608 (10GTirloni) [18:55:59] 10Operations, 10cloud-services-team: Create a jessie netboot image with the 4.9 Linux kernel - https://phabricator.wikimedia.org/T206761 (10GTirloni) New cloudvirts are being installed with Stretch. [18:56:13] 10Operations, 10cloud-services-team (Kanban): Create a jessie netboot image with the 4.9 Linux kernel - https://phabricator.wikimedia.org/T206761 (10GTirloni) 05Open→03Resolved [18:57:00] (03CR) 10Volans: [C: 03+2] "Thanks a lot for the contribution, looks good to me." [software/cumin] - 10https://gerrit.wikimedia.org/r/497309 (https://phabricator.wikimedia.org/T218441) (owner: 10TheAnarcat) [18:57:15] 10Operations, 10cloud-services-team (Kanban): Netbox: Usage guidelines for WMCS - https://phabricator.wikimedia.org/T208576 (10GTirloni) 05Open→03Declined [18:57:37] xSavitar confirming that the error is gone [18:57:48] \o/ [18:57:55] Thanks a lot Daimona for assisting me on this one! [18:57:59] Much appreciated [18:58:22] np ;) [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T1900) [19:00:58] (03PS1) 10Muehlenhoff: Fix pinning for smartmontools [puppet] - 10https://gerrit.wikimedia.org/r/498205 (https://phabricator.wikimedia.org/T216711) [19:02:09] (03CR) 10Volans: "Couple of nitpicks/whishlist inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/497563 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [19:10:45] (03Merged) 10jenkins-bot: PuppetDB backend: allow to override URL scheme in config [software/cumin] - 10https://gerrit.wikimedia.org/r/497309 (https://phabricator.wikimedia.org/T218441) (owner: 10TheAnarcat) [19:14:47] (03PS7) 10Jbond: Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 [19:16:59] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 58.18, 37.57, 24.98 [19:18:24] (03CR) 10Muehlenhoff: [C: 03+1] role::labs::instance - Do not instantiate diamond if diamond::remove is true [puppet] - 10https://gerrit.wikimedia.org/r/498124 (https://phabricator.wikimedia.org/T218365) (owner: 10GTirloni) [19:20:14] (03CR) 10Jbond: Add prometheus interface to spicerack (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [19:23:21] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 12.93, 23.02, 22.98 [19:28:17] (03PS10) 10Jbond: debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 [19:28:42] I've got some more MW config to push. Stopping fatals in production, go me. [19:28:52] (03PS5) 10Jforrester: Added a setting to define Wikibase entity types that have no RDF output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490586 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [19:28:57] (03CR) 10Jforrester: [C: 03+2] Added a setting to define Wikibase entity types that have no RDF output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490586 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [19:29:50] (03CR) 10Jbond: [C: 03+2] debdeploy: add config to filter out services [puppet] - 10https://gerrit.wikimedia.org/r/497481 (owner: 10Jbond) [19:30:01] (03CR) 10MarcoAurelio: "Perhaps needs an update after this?" [puppet] - 10https://gerrit.wikimedia.org/r/497758 (owner: 10Muehlenhoff) [19:30:07] (03Merged) 10jenkins-bot: Added a setting to define Wikibase entity types that have no RDF output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490586 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [19:35:13] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:36:11] PROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:36:19] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:37:53] PROBLEM - puppet last run on ores2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:39:22] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T213483 Set default wmgWikibaseEntityTypesWithoutRdfOutput value (duration: 00m 51s) [19:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:25] T213483: Fatal error: null EntityId argument passed to addEntityRedirect (Special:EntityData broken for some items) - https://phabricator.wikimedia.org/T213483 [19:40:45] !log jforrester@deploy1001 Synchronized wmf-config/Wikibase.php: T213483 Read wmgWikibaseEntityTypesWithoutRdfOutput value (duration: 00m 50s) [19:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:07] (03PS4) 10Jforrester: Disable RDF output of mediainfo Wikibase entities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490587 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [19:41:15] RECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational [19:42:10] (03CR) 10Jforrester: [C: 03+2] Disable RDF output of mediainfo Wikibase entities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490587 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [19:42:14] (03PS2) 10GTirloni: role::labs::instance - Do not instantiate diamond if diamond::remove is true [puppet] - 10https://gerrit.wikimedia.org/r/498124 (https://phabricator.wikimedia.org/T218365) [19:43:11] (03Merged) 10jenkins-bot: Disable RDF output of mediainfo Wikibase entities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490587 (https://phabricator.wikimedia.org/T213483) (owner: 10WMDE-leszek) [19:43:26] (03CR) 10MarcoAurelio: [C: 03+1] "Replacements already merged on some other projects as well:" [puppet] - 10https://gerrit.wikimedia.org/r/498057 (https://phabricator.wikimedia.org/T218844) (owner: 10MarcoAurelio) [19:44:35] (03CR) 10GTirloni: [C: 03+2] role::labs::instance - Do not instantiate diamond if diamond::remove is true [puppet] - 10https://gerrit.wikimedia.org/r/498124 (https://phabricator.wikimedia.org/T218365) (owner: 10GTirloni) [19:45:55] !log jforrester@deploy1001 Synchronized wmf-config/Wikibase.php: T213483 Disable RDF output of mediainfo Wikibase entities (duration: 00m 49s) [19:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:59] T213483: Fatal error: null EntityId argument passed to addEntityRedirect (Special:EntityData broken for some items) - https://phabricator.wikimedia.org/T213483 [19:49:24] (03PS1) 10Papaul: DHCP: Add MAC address entries for dpprov200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/498212 [19:50:14] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Add MAC address entries for dpprov200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/498212 (owner: 10Papaul) [19:53:45] stephanebisson: I don't know. my brain fried [19:53:47] :D [20:00:30] (03PS1) 10BryanDavis: striker: let uwsgi container and app logs flow to stdout/stderr [puppet] - 10https://gerrit.wikimedia.org/r/498214 (https://phabricator.wikimedia.org/T217932) [20:03:01] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:03:02] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:03:02] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:11] RECOVERY - puppet last run on ores2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:10:44] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) [20:14:38] (03CR) 10Gergő Tisza: [C: 03+1] Add WikimediaEditorTasks to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496210 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [20:16:03] PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:16:55] 10Puppet, 10Cloud-Services, 10Phabricator, 10cloud-services-team (Kanban): puppet function ipresolve unable to look up instance on labs-puppetmaster - https://phabricator.wikimedia.org/T139011 (10GTirloni) [20:17:23] RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational [20:17:27] 10Operations, 10cloud-services-team (Kanban): netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994 (10GTirloni) [20:18:02] (03CR) 10CRusnov: "Thanks! Will add to list of things." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/497563 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [20:18:04] (03PS2) 10Mholloway: Add WikimediaEditorTasks to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496210 (https://phabricator.wikimedia.org/T218137) [20:21:10] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --set main_app.kafka_broker_list=kafka-jumbo1002.eqiad.wmnet:9092 stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:21:12] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:12] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:55] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --set main_app.kafka_broker_list=kafka-jumbo1001.eqiad.wmnet:9092 stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:22:56] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:22:56] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:41] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --set main_app.kafka_broker_list=kafka-jumbo1001.eqiad.wmnet:9092 stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:23:42] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:42] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:57] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --set main_app.kafka_broker_list=kafka-jumbo1003.eqiad.wmnet:9092 stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:23:58] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:23:58] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:38] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --set main_app.kafka_broker_list=kafka-jumbo1004.eqiad.wmnet:9092 stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:24:39] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:24:39] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:33] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --set main_app.kafka_broker_list=kafka-jumbo1005.eqiad.wmnet:9092 stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:26:34] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:26:34] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:51] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --set main_app.kafka_broker_list=kafka-jumbo1006.eqiad.wmnet:9092 stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:52] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:27:52] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:49] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [20:29:09] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --set main_app.kafka_broker_list=kafka-jumbo1006.eqiad.wmnet:9092 stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:29:10] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:29:10] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:27] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:29:28] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:29:28] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:08] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): convert cloud VPS projects from apache to httpd module - https://phabricator.wikimedia.org/T202574 (10GTirloni) [20:40:21] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.395 second response time https://phabricator.wikimedia.org/T174916 [20:43:22] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:24] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:43:24] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:17] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [20:44:35] (03PS3) 10Herron: modsec directive @ipMatchFromFile broken for apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/498171 [20:45:12] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) ` papaul@asw-a-codfw# run show interfaces xe-4/0/18 descriptions Interface Admin Link Description xe-4/0/18 up... [20:47:39] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) [20:51:19] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) @Marostegui @jcrespo all is set at my end RAID 0 for the 2 SSD's and RAID 6 for the 8 other disks don't know who to assign the t... [20:52:09] (03CR) 10Herron: [C: 03+2] modsec directive @ipMatchFromFile broken for apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/498171 (owner: 10Herron) [20:52:17] (03PS4) 10Herron: modsec directive @ipMatchFromFile broken for apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/498171 [20:52:36] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Papaul) Also please don't forget to merge the DHCP and DNS changes. [20:52:46] (03PS1) 10CRusnov: Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 [20:53:01] 10Operations, 10Cloud-Services, 10netops, 10cloud-services-team (Kanban): Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10GTirloni) [20:53:35] (03CR) 10jerkins-bot: [V: 04-1] Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 (owner: 10CRusnov) [20:54:23] (03PS1) 10EBernhardson: [WIP] Switch mjolnir to rsyslog based structured logging [puppet] - 10https://gerrit.wikimedia.org/r/498232 (https://phabricator.wikimedia.org/T218833) [20:54:57] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Switch mjolnir to rsyslog based structured logging [puppet] - 10https://gerrit.wikimedia.org/r/498232 (https://phabricator.wikimedia.org/T218833) (owner: 10EBernhardson) [20:55:37] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:55:39] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:55:39] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:21] (03PS2) 10CRusnov: Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 [20:57:23] (03CR) 10jerkins-bot: [V: 04-1] Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 (owner: 10CRusnov) [21:08:38] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [21:08:39] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.175 second response time https://phabricator.wikimedia.org/T174916 [21:08:39] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [21:08:39] !log otto@deploy1001 scap-helm eventgate-analytics finished [21:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:29] !log remove peering sessions to AS7385 on cr4-ulsfo [21:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:33] papaul: ^ [21:12:33] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [21:12:38] XioNoX: see it [21:14:12] (03PS3) 10CRusnov: Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 [21:15:18] (03CR) 10jerkins-bot: [V: 04-1] Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 (owner: 10CRusnov) [21:17:19] 10Operations: Puppet constantly trying to stop the already stopped puppetmaster process on Trusty - https://phabricator.wikimedia.org/T159536 (10GTirloni) Closing as Trusty is deprecated. [21:17:27] 10Operations: Puppet constantly trying to stop the already stopped puppetmaster process on Trusty - https://phabricator.wikimedia.org/T159536 (10GTirloni) 05Open→03Invalid [21:18:28] (03PS4) 10CRusnov: Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 [21:19:06] argh annoying patch pining me all the time :P [21:19:56] (03CR) 10jerkins-bot: [V: 04-1] Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 (owner: 10CRusnov) [21:24:16] (03PS5) 10CRusnov: Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 [21:25:46] (03CR) 10jerkins-bot: [V: 04-1] Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 (owner: 10CRusnov) [21:27:34] (03PS1) 10ArielGlenn: use MediaWiki maintenance script to get db user and password [dumps] - 10https://gerrit.wikimedia.org/r/498245 (https://phabricator.wikimedia.org/T218923) [21:27:51] (03CR) 10jerkins-bot: [V: 04-1] use MediaWiki maintenance script to get db user and password [dumps] - 10https://gerrit.wikimedia.org/r/498245 (https://phabricator.wikimedia.org/T218923) (owner: 10ArielGlenn) [21:32:00] (03PS2) 10ArielGlenn: use MediaWiki maintenance script to get db user and password [dumps] - 10https://gerrit.wikimedia.org/r/498245 (https://phabricator.wikimedia.org/T218923) [21:34:13] (03PS3) 10Mholloway: Add WikimediaEditorTasks to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496210 (https://phabricator.wikimedia.org/T218137) [21:34:15] (03PS5) 10Mholloway: WikimediaEditorTasks: Add config to InitializeSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) [21:34:17] (03PS5) 10Mholloway: WikimediaEditorTasks: Add Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496212 (https://phabricator.wikimedia.org/T218137) [21:34:19] (03PS5) 10Mholloway: WikimediaEditorTasks: Load extension in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496213 (https://phabricator.wikimedia.org/T218137) [21:36:09] (03CR) 10jerkins-bot: [V: 04-1] WikimediaEditorTasks: Add config to InitializeSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:36:11] (03CR) 10jerkins-bot: [V: 04-1] WikimediaEditorTasks: Add Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496212 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:36:32] (03CR) 10jerkins-bot: [V: 04-1] WikimediaEditorTasks: Load extension in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496213 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:38:49] (03PS6) 10Mholloway: WikimediaEditorTasks: Add config to InitializeSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) [21:38:51] (03PS6) 10Mholloway: WikimediaEditorTasks: Add Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496212 (https://phabricator.wikimedia.org/T218137) [21:38:53] (03PS6) 10Mholloway: WikimediaEditorTasks: Load extension in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496213 (https://phabricator.wikimedia.org/T218137) [21:39:10] !log Ping offload - replace test IP with text-lb.codfw IP on cr1/2-codfw - T190090 [21:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:13] T190090: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 [21:40:00] (03CR) 10jerkins-bot: [V: 04-1] WikimediaEditorTasks: Add config to InitializeSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:40:02] (03CR) 10jerkins-bot: [V: 04-1] WikimediaEditorTasks: Add Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496212 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:40:15] (03CR) 10jerkins-bot: [V: 04-1] WikimediaEditorTasks: Load extension in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496213 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:40:19] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [21:41:13] (03PS6) 10CRusnov: Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 (https://phabricator.wikimedia.org/T218956) [21:41:27] (03PS7) 10Mholloway: WikimediaEditorTasks: Add config to InitializeSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) [21:41:29] (03PS7) 10Mholloway: WikimediaEditorTasks: Add Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496212 (https://phabricator.wikimedia.org/T218137) [21:41:31] (03PS7) 10Mholloway: WikimediaEditorTasks: Load extension in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496213 (https://phabricator.wikimedia.org/T218137) [21:42:34] (03CR) 10jerkins-bot: [V: 04-1] Add sshguard to base module. [puppet] - 10https://gerrit.wikimedia.org/r/498231 (https://phabricator.wikimedia.org/T218956) (owner: 10CRusnov) [21:43:09] (03CR) 10jerkins-bot: [V: 04-1] WikimediaEditorTasks: Load extension in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496213 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:44:02] (03PS8) 10Mholloway: WikimediaEditorTasks: Load extension in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496213 (https://phabricator.wikimedia.org/T218137) [21:45:22] !log gehel@cumin1001 START - Cookbook sre.elasticsearch.e6-upgrade [21:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:06] (03PS2) 10Esanders: VE section editing: Enable mobile AB test on remaining target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498084 (https://phabricator.wikimedia.org/T218851) [21:46:14] (03PS3) 10Esanders: VE section editing: Enable mobile AB test on remaining target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498084 (https://phabricator.wikimedia.org/T218851) [21:47:51] (03CR) 10Mholloway: [C: 03+2] Add WikimediaEditorTasks to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496210 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:48:03] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: Add config to InitializeSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:48:06] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: Add Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496212 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:48:13] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: Load extension in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496213 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:48:50] (03Merged) 10jenkins-bot: Add WikimediaEditorTasks to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496210 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:49:00] (03Merged) 10jenkins-bot: WikimediaEditorTasks: Add config to InitializeSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496211 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:49:16] (03Merged) 10jenkins-bot: WikimediaEditorTasks: Add Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496212 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:49:22] (03Merged) 10jenkins-bot: WikimediaEditorTasks: Load extension in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496213 (https://phabricator.wikimedia.org/T218137) (owner: 10Mholloway) [21:52:41] !log mholloway-shell@deploy1001 Synchronized wmf-config/extension-list: Add WikimediaEditorTasks to extension-list (duration: 00m 50s) [21:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:15] !log Restarting pdfrender on scb1004 [21:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:21] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916 [21:54:40] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add WikimediaEditorTasks default config to InitializeSettings.php (duration: 00m 49s) [21:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:42] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Add WikimediaEditorTasks labs config to InitializeSettings-labs.php (duration: 00m 47s) [21:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:22] !log mholloway-shell@deploy1001 Synchronized wmf-config/CommonSettings.php: Enable WikimediaEditorTasks on the Beta Cluster (duration: 00m 49s) [22:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:24] !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=0) [22:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:51] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) a:05Papaul→03None [22:26:35] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [22:26:36] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [22:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:36] !log otto@deploy1001 scap-helm eventgate-analytics finished [22:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:39] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 7.562 second response time https://phabricator.wikimedia.org/T174916 [22:30:40] (03PS1) 10Bstorm: cloudvps: Fix error in lookup for diamond::remove [puppet] - 10https://gerrit.wikimedia.org/r/498255 (https://phabricator.wikimedia.org/T218959) [22:32:33] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [22:33:02] !log Restarting pdfrender on scb1003 [22:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:41] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.021 second response time https://phabricator.wikimedia.org/T174916 [22:37:12] (03CR) 10Bstorm: [C: 03+2] cloudvps: Fix error in lookup for diamond::remove [puppet] - 10https://gerrit.wikimedia.org/r/498255 (https://phabricator.wikimedia.org/T218959) (owner: 10Bstorm) [22:38:40] /77/ [22:52:16] jouncebot: now [22:52:16] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [22:52:29] Platonides: ? [22:52:45] * bd808 will wait until after swat [22:52:47] I was trying to type /77 [22:53:03] E_TOOMANYCHANNELS ;) [22:53:20] ypu :) [22:53:25] yup :) [22:59:24] I'll SWAT. [22:59:27] (03CR) 10Cwhite: "Looking good. Suggestions inline." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [22:59:38] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 3 others: eventgate-analytics k8s pods occasionally can't produce to kafka - https://phabricator.wikimedia.org/T218268 (10Ottomata) I don't know much more, but I have a lot more data! Here is a staging pod with trace logging enabled reproducin... [22:59:48] bd808: Want to go now? CI will take a while. :-) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190321T2300). [23:00:04] Amir1: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:12] o/ [23:00:14] James_F: thanks, but I have some more stuff to stage :) [23:00:21] OK, no worries. [23:00:44] Amir1: I'm merging your patch now. The CX one also needs a full i18n scap, so no worries. Can we do the config one first? [23:00:57] (03PS2) 10BryanDavis: striker: let uwsgi container and app logs flow to stdout/stderr [puppet] - 10https://gerrit.wikimedia.org/r/498214 (https://phabricator.wikimedia.org/T217932) [23:01:05] James_F: sure, thanks [23:01:10] (03CR) 10Jforrester: [C: 03+2] Add wikimaniawiki to another special group in Wikibase client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498140 (https://phabricator.wikimedia.org/T217730) (owner: 10Ladsgroup) [23:02:50] (03PS2) 10Jforrester: Add wikimaniawiki to another special group in Wikibase client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498140 (https://phabricator.wikimedia.org/T217730) (owner: 10Ladsgroup) [23:02:56] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498140 (https://phabricator.wikimedia.org/T217730) (owner: 10Ladsgroup) [23:04:04] (03Merged) 10jenkins-bot: Add wikimaniawiki to another special group in Wikibase client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498140 (https://phabricator.wikimedia.org/T217730) (owner: 10Ladsgroup) [23:04:19] (03PS3) 10BryanDavis: striker: let uwsgi container and app logs flow to stdout/stderr [puppet] - 10https://gerrit.wikimedia.org/r/498214 (https://phabricator.wikimedia.org/T217932) [23:04:39] Amir1: Live on mwdebug1002. Testable? [23:04:48] yeah [23:06:39] James_F: It doesn't seem to be working but it might be cache too [23:06:52] Hmm. Want me to sync it anyway? [23:07:54] yeah [23:07:59] it's needed anyway [23:08:00] Okie-dokie. [23:08:51] !log jforrester@deploy1001 Synchronized wmf-config/Wikibase.php: SWAT T217730 Add wikimaniawiki to another special group in Wikibase client (duration: 00m 49s) [23:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:55] T217730: Connect wikimaniawiki to Wikidata - https://phabricator.wikimedia.org/T217730 [23:09:04] Yay for a working SAL again. [23:09:08] * James_F waits for CI. [23:09:23] Thanks [23:15:45] (03PS4) 10BryanDavis: striker: let uwsgi container and app logs flow to stdout/stderr [puppet] - 10https://gerrit.wikimedia.org/r/498214 (https://phabricator.wikimedia.org/T217932) [23:15:47] (03PS1) 10BryanDavis: striker: Disable developer account creation [puppet] - 10https://gerrit.wikimedia.org/r/498262 [23:16:09] * James_F paints "go faster" stripes on the side of jenkins. [23:20:26] (03CR) 10Bstorm: [C: 03+2] striker: Disable developer account creation [puppet] - 10https://gerrit.wikimedia.org/r/498262 (owner: 10BryanDavis) [23:21:07] (03PS1) 10Ayounsi: Add Icinga alert to ping-offload dashboard alerts [puppet] - 10https://gerrit.wikimedia.org/r/498264 (https://phabricator.wikimedia.org/T190090) [23:21:31] (03CR) 10BryanDavis: striker: let uwsgi container and app logs flow to stdout/stderr (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498214 (https://phabricator.wikimedia.org/T217932) (owner: 10BryanDavis) [23:21:51] (03CR) 10jerkins-bot: [V: 04-1] Add Icinga alert to ping-offload dashboard alerts [puppet] - 10https://gerrit.wikimedia.org/r/498264 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [23:24:45] Gah, flaky tests in merge. [23:26:24] Amir1: OK, it's on mwdebug1002, but obviously I've not done the i18n rebuild yet. :-) [23:26:44] so :D [23:26:50] What can I do? [23:27:18] (03CR) 10Greg Grossmeier: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/498264 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [23:27:18] Nothing, just waiting for the third patch so I can scap them all together. [23:27:53] (03CR) 10jerkins-bot: [V: 04-1] Add Icinga alert to ping-offload dashboard alerts [puppet] - 10https://gerrit.wikimedia.org/r/498264 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [23:28:43] (03CR) 10Smalyshev: [C: 03+1] Add curl to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/498045 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [23:29:57] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:31:05] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 76220 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Application_servers [23:32:24] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.22/extensions/ContentTranslation/api/ApiQueryContentTranslationSuggestions.php: SWAT T218902 CX: Return API error on anonymous suggestions queries (duration: 00m 51s) [23:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:27] T218902: Log spam: "User account is not global" - https://phabricator.wikimedia.org/T218902 [23:34:05] jouncebot: refresh [23:34:06] I refreshed my knowledge about deployments. [23:38:40] Amir1: Really sorry about this. :-( [23:39:04] no no, don't worry [23:39:07] it can wait [23:42:05] (03PS2) 10Ayounsi: Add Icinga alert to ping-offload dashboard alerts [puppet] - 10https://gerrit.wikimedia.org/r/498264 (https://phabricator.wikimedia.org/T190090) [23:47:07] (03PS3) 10DannyS712: Change '/r/p/' to '/r/' for gerrit links [puppet] - 10https://gerrit.wikimedia.org/r/498167 [23:50:01] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/15269/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/498264 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [23:53:49] !log downtimed systemd check in labwen1001 (T210818) [23:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:52] T210818: Move admin cron jobs to systemd timers - https://phabricator.wikimedia.org/T210818 [23:54:19] (03PS1) 10CRusnov: Add synchronizing nodes to ganeti-netbox sync. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/498268 [23:56:08] !log jforrester@deploy1001 Started scap: SWAT: Full scap for i18n rebuild for 498259 and 498113 [23:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:31] Amir1: ^^ [23:56:42] bd808: SWAT is going late, sorry. :-( [23:56:42] Thanks! [23:56:58] James_F: no worries. It takes the time it takes [23:57:11] !log downtimed systemd check in labweb1001/1002 (T218935) [23:57:13] * bd808 ran the train enough times to become zen about jerkins [23:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:14] T218935: wikitech runJobs.php error - https://phabricator.wikimedia.org/T218935 [23:58:59] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers