[00:22:19] PROBLEM - HHVM rendering on mw1340 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:23:35] RECOVERY - HHVM rendering on mw1340 is OK: HTTP OK: HTTP/1.1 200 OK - 79430 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:26:25] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 19053 MB (3% inode=67%) [01:40:39] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 19154 MB (3% inode=67%) [02:11:39] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 19123 MB (3% inode=67%) [02:20:49] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17707920 and 0 seconds [02:22:07] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 71 seconds [02:35:45] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 138.9 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [02:40:07] PROBLEM - puppet last run on dns2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:06:33] RECOVERY - puppet last run on dns2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:21:15] (03CR) 10Andrew Bogott: [C: 04-1] "Minor thing about rearranging comments, otherwise looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501587 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [03:35:17] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:35:55] PROBLEM - puppet last run on mx1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:38:11] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 19041 MB (3% inode=67%) [03:42:07] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 19058 MB (3% inode=67%) [03:46:01] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 18923 MB (3% inode=67%) [03:54:59] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 18783 MB (3% inode=67%) [04:01:43] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:07:39] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:20:15] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:20:49] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 18568 MB (3% inode=67%) [04:27:11] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 18834 MB (3% inode=67%) [04:36:03] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 109.7 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [04:44:05] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 18972 MB (3% inode=67%) [04:46:41] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:50:33] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 18822 MB (3% inode=67%) [05:00:47] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 19085 MB (3% inode=67%) [05:06:05] I am checking that [05:12:48] 10Operations, 10monitoring: prometheus1004 /srv/prometheus/ops almost full - https://phabricator.wikimedia.org/T220326 (10Marostegui) [05:13:15] (03PS2) 10Marostegui: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501147 (https://phabricator.wikimedia.org/T143763) [05:15:00] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501147 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui) [05:16:09] (03Merged) 10jenkins-bot: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501147 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui) [05:18:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool all slaves in x1 T219777 T143763 (duration: 01m 30s) [05:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:07] T143763: Remove unused bundling DB fields - https://phabricator.wikimedia.org/T143763 [05:18:07] T219777: DBA review of UrlShortener - https://phabricator.wikimedia.org/T219777 [05:18:55] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 18952 MB (3% inode=67%) [05:20:02] !log Deploy schema change on x1 master with replication, there will be lag on x1 slaves T143763 [05:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:05] (03CR) 10jenkins-bot: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501147 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui) [05:27:07] PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:27:19] (03PS2) 10Marostegui: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) [05:34:17] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 18501 MB (3% inode=67%) [05:40:51] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 18789 MB (3% inode=67%) [05:42:45] 10Operations, 10monitoring: prometheus1004 /srv/prometheus/ops almost full - https://phabricator.wikimedia.org/T220326 (10Joe) p:05Triage→03Unbreak! [05:43:29] 10Operations, 10monitoring: prometheus1004 /srv/prometheus/ops almost full - https://phabricator.wikimedia.org/T220326 (10Joe) Triaged to UBN! as by my estimation the partition serving /srv/prometheus/ops will fill up in the next 2-3 days. [05:44:45] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 18787 MB (3% inode=67%) [05:53:35] RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:55:05] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 19116 MB (3% inode=67%) [06:01:29] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 18365 MB (3% inode=67%) [06:19:45] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 19008 MB (3% inode=67%) [06:28:15] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:28:53] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:32:15] <_joe_> is anyone looking into netbox? [06:32:56] <_joe_> ok looking into it [06:33:27] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:34:05] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.519 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:34:06] <_joe_> !log restarted netbox, SIGSEGV on HUP-induced reload [06:34:06] is it the logrotate issue? [06:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:14] ah yes [06:34:15] <_joe_> elukey: is this known? [06:34:40] _joe_ yep https://phabricator.wikimedia.org/T212697 [06:35:05] started a while a go [06:35:07] *ago [06:35:38] maybe we should prioritize it a bit [06:36:18] 10Operations: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10Joe) FWIW, this happened again today. Can we please try to get to the bottom of this? [06:36:25] <_joe_> it's at high priority already [06:36:30] <_joe_> how can we prioritize more? [06:36:33] PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 18475 MB (3% inode=67%) [06:36:48] <_joe_> ok I'm going to ack the prometheus disk space issue [06:40:27] <_joe_> uhm [06:40:37] <_joe_> why can't I see this alert in the icinga UI? [06:41:22] <_joe_> because it's back to warning [06:42:13] caused it is moving between critical and warn [06:42:17] yep [06:43:01] <_joe_> I did ack it, with a sticky ack [06:43:08] <_joe_> let's see if it works as intended [06:51:43] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool all x1 slaves" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502152 [06:59:39] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool all x1 slaves" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502152 (owner: 10Marostegui) [07:00:50] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool all x1 slaves" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502152 (owner: 10Marostegui) [07:02:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool all slaves in x1 T143763 (duration: 00m 58s) [07:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:39] T143763: Remove unused bundling DB fields - https://phabricator.wikimedia.org/T143763 [07:02:53] !log installing wget security updates [07:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:31] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool all x1 slaves" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502152 (owner: 10Marostegui) [07:13:15] (03PS1) 10Marostegui: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502154 (https://phabricator.wikimedia.org/T217453) [07:16:20] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502154 (https://phabricator.wikimedia.org/T217453) (owner: 10Marostegui) [07:17:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502154 (https://phabricator.wikimedia.org/T217453) (owner: 10Marostegui) [07:18:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool all slaves in x1 T217453 (duration: 00m 59s) [07:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:01] T217453: Remove etp_user from echo_target_page in production - https://phabricator.wikimedia.org/T217453 [07:19:08] !log Deploy schema change on the first 10 wikis - T217453 [07:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:32] (03CR) 10jenkins-bot: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502154 (https://phabricator.wikimedia.org/T217453) (owner: 10Marostegui) [07:19:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool all x1 slaves" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502155 [07:21:50] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool all x1 slaves" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502155 (owner: 10Marostegui) [07:23:33] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool all x1 slaves" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502155 (owner: 10Marostegui) [07:24:34] (03PS2) 10Marostegui: db-eqiad.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499737 (https://phabricator.wikimedia.org/T210725) [07:24:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool all slaves in x1 T217453 (duration: 00m 58s) [07:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:46] T217453: Remove etp_user from echo_target_page in production - https://phabricator.wikimedia.org/T217453 [07:27:14] (03CR) 10Muehlenhoff: [C: 03+1] admin: create analytics-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/501578 (https://phabricator.wikimedia.org/T220175) (owner: 10Elukey) [07:29:21] RECOVERY - Disk space on prometheus1004 is OK: DISK OK [07:30:02] (03CR) 10Dzahn: [C: 03+1] admin: create analytics-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/501578 (https://phabricator.wikimedia.org/T220175) (owner: 10Elukey) [07:31:08] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool all x1 slaves" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502155 (owner: 10Marostegui) [07:31:31] 10Operations, 10monitoring: prometheus1004 /srv/prometheus/ops almost full - https://phabricator.wikimedia.org/T220326 (10fgiunchedi) p:05Unbreak!→03Normal I've cleaned up the snapshots used for the migration and added 300G to the `ops` instance filesystem (matching prometheus1003). Lowering to normal as t... [07:31:38] (03PS1) 10Dzahn: bird: add Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/502156 [07:32:21] (03PS2) 10Dzahn: bird: add Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/502156 [07:33:20] (03CR) 10Dzahn: [C: 03+2] bird: add Icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/502156 (owner: 10Dzahn) [07:33:28] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10MoritzMuehlenhoff) 05Resolved→03Open labtestnet2003 is still in puppetdb: ` jmm@cumin2001:~$ sudo cu... [07:38:44] 10Operations, 10monitoring: prometheus1004 /srv/prometheus/ops almost full - https://phabricator.wikimedia.org/T220326 (10Marostegui) Can we attach a `notes_url` parameter to that alert so we know how to proceed in case this happens again? :) [07:47:13] 10Operations, 10monitoring: prometheus1004 /srv/prometheus/ops almost full - https://phabricator.wikimedia.org/T220326 (10fgiunchedi) >>! In T220326#5091992, @Marostegui wrote: > Can we attach a `notes_url` parameter to that alert so we know how to proceed in case this happens again? :) We definitely should!... [07:47:59] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "lgtm. confirmed one is in row B and one in row C as requeted in ticket. IPs look fine. just think this needs to be coordinated with runnin" [dns] - 10https://gerrit.wikimedia.org/r/501686 (owner: 10Papaul) [07:49:51] (03CR) 10Dzahn: [C: 04-2] "don't merge yet because bast2002 is waiting for a router ACL change to be allowed to talk to mgmt network and not just servers" [puppet] - 10https://gerrit.wikimedia.org/r/499449 (https://phabricator.wikimedia.org/T219492) (owner: 10Dzahn) [07:50:40] (03CR) 10Dzahn: [C: 04-2] "stalled - needs router ACL change to be allowed to talk to mgmt network" [puppet] - 10https://gerrit.wikimedia.org/r/499740 (https://phabricator.wikimedia.org/T219492) (owner: 10Dzahn) [07:54:25] (03PS3) 10Filippo Giunchedi: grafana: remove frack datasources [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825) [07:55:22] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: remove frack datasources [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825) (owner: 10Filippo Giunchedi) [07:57:39] (03PS5) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) [07:57:52] (03CR) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [07:57:59] (03CR) 10Marostegui: [C: 03+1] mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [07:59:13] !log upgrading mw1266-mw1255 to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [07:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:17] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [07:59:23] !log upgrading mw1266-mw1275 to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [07:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:19] (03PS2) 10Dzahn: add Icinga notes_url to various NRPE monitor checks, pt 3 [puppet] - 10https://gerrit.wikimedia.org/r/501568 [08:01:28] !log bounce grafana after https://gerrit.wikimedia.org/r/c/operations/puppet/+/501519 [08:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:39] (03PS1) 10Filippo Giunchedi: grafana: add frack to deleteDatasources [puppet] - 10https://gerrit.wikimedia.org/r/502158 [08:12:33] (03PS2) 10Filippo Giunchedi: grafana: add frack to deleteDatasources [puppet] - 10https://gerrit.wikimedia.org/r/502158 (https://phabricator.wikimedia.org/T219825) [08:12:37] (03CR) 10jerkins-bot: [V: 04-1] grafana: add frack to deleteDatasources [puppet] - 10https://gerrit.wikimedia.org/r/502158 (https://phabricator.wikimedia.org/T219825) (owner: 10Filippo Giunchedi) [08:13:04] (03CR) 10Filippo Giunchedi: "Didn't work out of the box, however Ief0f8a1e858b5 did" [puppet] - 10https://gerrit.wikimedia.org/r/501519 (https://phabricator.wikimedia.org/T219825) (owner: 10Filippo Giunchedi) [08:13:36] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: add frack to deleteDatasources [puppet] - 10https://gerrit.wikimedia.org/r/502158 (https://phabricator.wikimedia.org/T219825) (owner: 10Filippo Giunchedi) [08:13:46] (03CR) 10Dzahn: [C: 03+2] add Icinga notes_url to various NRPE monitor checks, pt 3 [puppet] - 10https://gerrit.wikimedia.org/r/501568 (owner: 10Dzahn) [08:13:56] (03PS3) 10Dzahn: add Icinga notes_url to various NRPE monitor checks, pt 3 [puppet] - 10https://gerrit.wikimedia.org/r/501568 [08:16:32] (03PS5) 10Dzahn: varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) [08:16:50] (03CR) 10Dzahn: [C: 04-2] "stalled" [puppet] - 10https://gerrit.wikimedia.org/r/499224 (https://phabricator.wikimedia.org/T219492) (owner: 10Dzahn) [08:17:26] !log delete fundraising folder from public grafana - T219825 [08:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:30] T219825: Update dashboards to node-exporter 0.16+ metric names - https://phabricator.wikimedia.org/T219825 [08:19:41] (03CR) 10Dzahn: [C: 04-2] "stalled by https://phabricator.wikimedia.org/T215335 (dcops)" [puppet] - 10https://gerrit.wikimedia.org/r/496119 (https://phabricator.wikimedia.org/T215335) (owner: 10Dzahn) [08:20:49] (03PS2) 10Dzahn: mediawiki::cgroup: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/448778 (https://phabricator.wikimedia.org/T194724) [08:22:30] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) I looked into the option of reverting the fixes in php 7.2, and it would mean bas... [08:25:17] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15633/mwdebug1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/448778 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [08:27:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, there are no further transition packages in use. Also, see comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [08:31:18] !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-codfw.yaml production stable/zotero [namespace: zotero, clusters: codfw] [08:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:26] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [08:31:26] !log akosiaris@deploy1001 scap-helm zotero finished [08:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:12] !log lower CPU, memory limits for zotero pods. Set 1 cpu, 700Mi. This should help the pods to recover faster in some cases. The old memory leak issues we used to have seem to be no longer present [08:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:35] !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-eqiad.yaml production stable/zotero [namespace: zotero, clusters: eqiad] [08:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:43] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [08:32:43] !log akosiaris@deploy1001 scap-helm zotero finished [08:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:50] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10jcrespo) So, I think Joe's proposal it to patch/revert the behaviour on PHP7 source co... [08:34:32] !log akosiaris@deploy1001 scap-helm zotero upgrade -f zotero-values-staging.yaml --reset-values staging stable/zotero [namespace: zotero, clusters: staging] [08:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:39] !log akosiaris@deploy1001 scap-helm zotero cluster staging completed [08:34:39] !log akosiaris@deploy1001 scap-helm zotero finished [08:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:15] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10akosiaris) >>! In T219923#5086476, @Pchelolo wrote: > Apparently `... [08:38:46] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:41:18] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:45:26] !log upgrading API servers mw1221-mw1235 to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [08:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:29] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [08:50:26] PROBLEM - HHVM rendering on mw1224 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:50:28] PROBLEM - Apache HTTP on mw1222 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:50:28] PROBLEM - Nginx local proxy to apache on mw1223 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:50:30] PROBLEM - HHVM rendering on mw1221 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:50:50] PROBLEM - Nginx local proxy to apache on mw1224 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:50:58] PROBLEM - Apache HTTP on mw1223 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:51:02] PROBLEM - Apache HTTP on mw1224 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:51:02] PROBLEM - Nginx local proxy to apache on mw1222 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:51:10] PROBLEM - HHVM rendering on mw1222 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:51:54] PROBLEM - HHVM processes on mw1222 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [08:51:59] ^ that's me, all fine [08:52:02] silencing [08:52:04] PROBLEM - HHVM processes on mw1221 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [08:52:10] PROBLEM - Check systemd state on mw1224 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:52:42] PROBLEM - Check systemd state on mw1223 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:52:42] PROBLEM - HHVM processes on mw1223 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [08:53:00] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/501556 (owner: 10Gehel) [08:53:32] RECOVERY - Nginx local proxy to apache on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 1.833 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:53:32] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 79420 bytes in 0.932 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:53:42] RECOVERY - Check systemd state on mw1223 is OK: OK - running: The system is fully operational [08:53:44] RECOVERY - HHVM processes on mw1223 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [08:53:54] RECOVERY - Nginx local proxy to apache on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.264 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:53:56] RECOVERY - HHVM processes on mw1222 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [08:54:02] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:54:06] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:54:06] RECOVERY - Nginx local proxy to apache on mw1222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:54:06] RECOVERY - HHVM processes on mw1221 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [08:54:14] RECOVERY - Check systemd state on mw1224 is OK: OK - running: The system is fully operational [08:54:14] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 79439 bytes in 0.303 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:54:34] RECOVERY - Apache HTTP on mw1222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:54:36] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 79441 bytes in 5.124 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:55:22] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:55:28] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:56:18] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [08:57:01] <_joe_> whoa [08:57:28] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [08:57:42] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:57:57] (03PS5) 10Alaa Sarhan: Add wgWikibaseMusicalNotationLineWidthInches to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191) [08:58:15] (03PS9) 10Alaa Sarhan: Add wgWikibaseMusicalNotationLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) [08:58:48] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:00:56] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:00:56] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:01:10] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:05:21] (03CR) 10Filippo Giunchedi: [C: 03+1] admin: create analytics-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/501578 (https://phabricator.wikimedia.org/T220175) (owner: 10Elukey) [09:06:37] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:08:31] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:08:47] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:09:59] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:10:23] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:12:06] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:13:14] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:14:32] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Elukey: Requesting ability to scap-deploy on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10fgiunchedi) >>! In T220175#5088610, @elukey wrote: >>>! In T220175#5088605, @RobH wrote: >> Does this new group have sudo? If they... [09:18:45] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) >>! In T219279#5092155, @jcrespo wrote: > So, I think Joe's proposal it to patch/... [09:19:47] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10jcrespo) > I'd like to hear the opinion of someone with more experience with MediaWiki... [09:19:58] !log restarting icinga on icinga1001 - T196336 [09:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:02] T196336: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 [09:21:47] (03PS2) 10Gehel: elasticsearch: cleanup logging during shard allocation [software/spicerack] - 10https://gerrit.wikimedia.org/r/501556 [09:30:37] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:30:38] !log T219776 puppet node clean labtestnet2003.codfw.wmnet [09:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:42] T219776: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 [09:31:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is a setting that is generally meant to be set on master per https://www.postgresql.org/docs/9.4/runtime-config-replication.html, is " [puppet] - 10https://gerrit.wikimedia.org/r/501384 (https://phabricator.wikimedia.org/T219652) (owner: 10Bstorm) [09:31:21] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:32:00] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10Volans) The log from today makes me thing that there is some sort of race-condition when we reload icinga (triggered by puppet usually) and the p... [09:32:09] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:34:09] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:34:09] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:34:11] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:34:25] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [09:35:05] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:35:14] (03PS2) 10Alexandros Kosiaris: waf: Move to httpd::conf instead of httpd::site [puppet] - 10https://gerrit.wikimedia.org/r/501639 [09:35:25] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [09:35:47] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:35:53] (03CR) 10Volans: [C: 03+2] elasticsearch: cleanup logging during shard allocation (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/501556 (owner: 10Gehel) [09:35:56] 10Operations, 10Commons, 10Multimedia, 10Thumbor: Only one thumbor server (thumbor1002) upgraded to librsvg 2.40.20-3 - https://phabricator.wikimedia.org/T220342 (10MoritzMuehlenhoff) [09:36:06] volans: thanks! [09:36:09] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:36:13] gehel: yw :) [09:36:18] 10Operations, 10Commons, 10Multimedia, 10Thumbor, 10serviceops: Only one thumbor server (thumbor1002) upgraded to librsvg 2.40.20-3 - https://phabricator.wikimedia.org/T220342 (10jijiki) [09:36:58] (03PS3) 10Alexandros Kosiaris: waf: Move to httpd::conf instead of httpd::site [puppet] - 10https://gerrit.wikimedia.org/r/501639 [09:37:00] (03PS2) 10Alexandros Kosiaris: waf: Remove realm if guards [puppet] - 10https://gerrit.wikimedia.org/r/501640 [09:37:15] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:37:22] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/501639 (owner: 10Alexandros Kosiaris) [09:37:35] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] waf: Remove realm if guards [puppet] - 10https://gerrit.wikimedia.org/r/501640 (owner: 10Alexandros Kosiaris) [09:39:31] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:39:46] (03Merged) 10jenkins-bot: elasticsearch: cleanup logging during shard allocation [software/spicerack] - 10https://gerrit.wikimedia.org/r/501556 (owner: 10Gehel) [09:40:49] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:40:52] (03CR) 10jenkins-bot: elasticsearch: cleanup logging during shard allocation [software/spicerack] - 10https://gerrit.wikimedia.org/r/501556 (owner: 10Gehel) [09:41:08] (03Restored) 10Dzahn: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [09:41:11] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:42:54] (03CR) 10Muehlenhoff: "See https://phabricator.wikimedia.org/T198939#5085532" [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [09:43:20] !log force allocation of 3 unassigned shards on elasticsearch / cirrus / eqiad [09:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:57] (03Abandoned) 10Dzahn: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [09:44:45] (03PS1) 10Alexandros Kosiaris: eqiad1::labweb: Remove realm ifguard for waf [puppet] - 10https://gerrit.wikimedia.org/r/502166 [09:44:59] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:45:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] eqiad1::labweb: Remove realm ifguard for waf [puppet] - 10https://gerrit.wikimedia.org/r/502166 (owner: 10Alexandros Kosiaris) [09:46:40] (03PS1) 10Muehlenhoff: Remove now obsolete tools-checker-grid-start-trusty monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/502167 [09:47:19] (03PS7) 10Alexandros Kosiaris: service::node: Set config-vars.yaml's mode to 0440 [puppet] - 10https://gerrit.wikimedia.org/r/469791 (https://phabricator.wikimedia.org/T207143) (owner: 10Mobrovac) [09:48:33] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:52:03] (03CR) 10Alexandros Kosiaris: "LGTM. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/469791 (https://phabricator.wikimedia.org/T207143) (owner: 10Mobrovac) [09:52:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] service::node: Set config-vars.yaml's mode to 0440 [puppet] - 10https://gerrit.wikimedia.org/r/469791 (https://phabricator.wikimedia.org/T207143) (owner: 10Mobrovac) [09:52:11] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:54:01] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:54:23] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:55:37] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:56:25] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10aborrero) Not sure if I should drop the concrete registers in the puppetdb, or if cleaning the node is eno... [09:57:41] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 2.635 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:58:19] 10Operations: decom netmon1003 - https://phabricator.wikimedia.org/T220355 (10Dzahn) [09:59:55] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:00:27] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:01:44] (03CR) 10Giuseppe Lavagetto: "> Patch Set 3:" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 (owner: 10Giuseppe Lavagetto) [10:01:55] (03CR) 10Giuseppe Lavagetto: "> Patch Set 2: Code-Review+1" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501183 (owner: 10Giuseppe Lavagetto) [10:01:57] (03PS1) 10Muehlenhoff: Add library hint for librsvg [puppet] - 10https://gerrit.wikimedia.org/r/502170 [10:02:20] (03PS2) 10Muehlenhoff: Add library hint for librsvg [puppet] - 10https://gerrit.wikimedia.org/r/502170 [10:02:31] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [10:02:44] (03PS1) 10Dzahn: turn netmon1003 into a spare, delete servermon role [puppet] - 10https://gerrit.wikimedia.org/r/502171 (https://phabricator.wikimedia.org/T198939) [10:02:46] (03PS1) 10Dzahn: mariadb: revoke servermon grants [puppet] - 10https://gerrit.wikimedia.org/r/502172 (https://phabricator.wikimedia.org/T198939) [10:02:48] (03PS1) 10Dzahn: deployment_server: remove servermon [puppet] - 10https://gerrit.wikimedia.org/r/502173 (https://phabricator.wikimedia.org/T198939) [10:02:59] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:03:02] (03PS1) 10Dzahn: delete servermon module [puppet] - 10https://gerrit.wikimedia.org/r/502174 (https://phabricator.wikimedia.org/T198939) [10:03:04] (03PS1) 10Dzahn: puppetmaster: remove servermon report [puppet] - 10https://gerrit.wikimedia.org/r/502175 (https://phabricator.wikimedia.org/T198939) [10:03:05] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:03:06] (03PS1) 10Dzahn: mariadb::ferm_misc: remove firewall rule for servermon [puppet] - 10https://gerrit.wikimedia.org/r/502176 (https://phabricator.wikimedia.org/T198939) [10:03:10] (03PS1) 10Dzahn: cache::text: remove netmon1003 backend [puppet] - 10https://gerrit.wikimedia.org/r/502177 (https://phabricator.wikimedia.org/T220355) [10:03:12] (03PS1) 10Dzahn: install_server: decom netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/502178 (https://phabricator.wikimedia.org/T220355) [10:03:47] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for librsvg [puppet] - 10https://gerrit.wikimedia.org/r/502170 (owner: 10Muehlenhoff) [10:04:15] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:04:26] akosiaris: ok to merge your service::node patch along? [10:04:30] (03PS1) 10Effie Mouzeli: thumbor: Fix apt pinning for wikimedia-thumbor [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) [10:05:07] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:05:53] moritzm: yeah, thanks [10:06:09] ack, doing that now [10:06:17] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:06:25] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [10:06:53] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:09:15] (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1002/15636/thumbor2003.codfw.wmnet/ looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) (owner: 10Effie Mouzeli) [10:14:43] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [10:14:49] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:14:50] (03PS1) 10Filippo Giunchedi: admin: add evanp to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/502186 (https://phabricator.wikimedia.org/T220226) [10:15:59] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:18:54] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:19:25] moritzm jbond42 PTAL https://gerrit.wikimedia.org/r/c/operations/puppet/+/502186 when you have a moment, seems straightforward [10:19:48] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:20:05] looking [10:21:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, staff additions to the cn=wmf LDAP group can be merged rightaway." [puppet] - 10https://gerrit.wikimedia.org/r/502186 (https://phabricator.wikimedia.org/T220226) (owner: 10Filippo Giunchedi) [10:21:15] according to clinic duty guide there's that addition plus "modify-ldap-group wmf" [10:21:27] ack [10:21:33] thanks moritzm ! [10:21:35] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add evanp to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/502186 (https://phabricator.wikimedia.org/T220226) (owner: 10Filippo Giunchedi) [10:21:43] (03PS2) 10Filippo Giunchedi: admin: add evanp to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/502186 (https://phabricator.wikimedia.org/T220226) [10:23:00] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:23:09] !log Running debdeploy to upgrade librsvg [10:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:54] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:25:02] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Evan Prodromou - https://phabricator.wikimedia.org/T220226 (10fgiunchedi) @EvanProdromou you should now have access to `wmf` ldap group, please confirm/verify! [10:27:37] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10Dzahn) I think you need both "node clean" and "node deactivate". [10:27:56] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:30:05] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190408T1030). [10:30:33] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Package envoy 1.9.X for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 (10fsero) [10:30:52] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Package envoy 1.9.X for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 (10fsero) Building 1.9.1 due to CVE [10:31:16] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:32:24] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:32:27] jouncebot, next [10:32:27] In 0 hour(s) and 27 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190408T1100) [10:33:02] (03PS1) 10Volans: netbox: set login banner [puppet] - 10https://gerrit.wikimedia.org/r/502189 [10:33:22] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Netbox [10:33:34] apergos: ^^^ [10:34:06] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:34:09] ops, that was me, sorry [10:34:22] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:34:32] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 346 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Netbox [10:34:36] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502190 (https://phabricator.wikimedia.org/T128546) [10:34:36] silencing the alarms [10:34:43] !log upgrading app servers in codfw to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [10:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:47] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [10:35:14] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10serviceops, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) a:03Joe >>! In T219279#5092106, @Joe wrote: > I looked into the... [10:35:30] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 5.428 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:35:58] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:36:19] (03CR) 10ArielGlenn: [C: 03+1] netbox: set login banner [puppet] - 10https://gerrit.wikimedia.org/r/502189 (owner: 10Volans) [10:36:22] awesome [10:36:48] elukey: the problem iss that the pybal one is generic, so if we silence it we loose visibility on other issues [10:37:08] (03CR) 10Volans: [C: 03+2] netbox: set login banner [puppet] - 10https://gerrit.wikimedia.org/r/502189 (owner: 10Volans) [10:38:01] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502190 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:38:04] volans: mmm ok I thought there was one for each endpoint, my bad. Going to remove it and keep aqs silenced [10:38:31] I'm referring to the PyBal backends health check on lvs1016 [10:38:32] one [10:38:44] (03CR) 10Vgutierrez: [C: 03+2] profile::cache::ssl::unified: Allow passing certs/certs_active by hiera [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [10:38:56] (03PS5) 10Vgutierrez: profile::cache::ssl::unified: Allow passing certs/certs_active by hiera [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [10:39:10] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502190 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:39:36] volans: ah sure, I didn't disable that one [10:39:41] ack then [10:40:12] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:41:14] just done --^ :P [10:41:41] lol [10:41:44] (lunch) [10:41:49] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:502190| Bumping portals to master (T128546)]] (duration: 00m 59s) [10:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:53] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:42:48] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:502190| Bumping portals to master (T128546)]] (duration: 00m 58s) [10:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:04] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502190 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:49:52] (03PS13) 10Fsero: Enabling docker registry swift cross dc replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [10:51:00] 10Operations, 10Acme-chief, 10Traffic: Benefit from acme-chief features in acme-chief clients - https://phabricator.wikimedia.org/T220359 (10Vgutierrez) [10:51:14] 10Operations, 10Acme-chief, 10Traffic: Benefit from acme-chief features in acme-chief clients - https://phabricator.wikimedia.org/T220359 (10Vgutierrez) p:05Triage→03Normal [10:52:40] (03CR) 10Fsero: "still happy https://puppet-compiler.wmflabs.org/compiler1002/15637/" [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [10:53:11] (03CR) 10Vgutierrez: [C: 03+1] apache redirects: remove wicipediacymraeg.org [puppet] - 10https://gerrit.wikimedia.org/r/501202 (https://phabricator.wikimedia.org/T219856) (owner: 10Dzahn) [10:53:48] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:55:38] (03PS1) 10Vgutierrez: get rid of wicipediacymraeg.org [dns] - 10https://gerrit.wikimedia.org/r/502193 (https://phabricator.wikimedia.org/T219856) [10:57:40] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:58:54] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190408T1100). [11:00:04] hoo, Urbanecm, and Ammarpad: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] * Urbanecm is here [11:01:46] hoo, Ammarpad: around for swat? [11:02:23] 10Operations: Audit our infrastructure for authenticated services - https://phabricator.wikimedia.org/T220361 (10MoritzMuehlenhoff) [11:02:46] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:02:54] (03CR) 10Giuseppe Lavagetto: Add an update action (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [11:02:56] 10Operations: Evaluate SSO solutions - https://phabricator.wikimedia.org/T220362 (10MoritzMuehlenhoff) [11:04:25] Urbanecm: ok, looks it's just the two of us [11:04:39] I'll ping you in a few minutes when your patch is ready for testing [11:04:43] sure [11:05:24] you may deploy Ammarpad's patch together with my patch, if you want, I don't feel comfortable with testing the other one. [11:06:25] Urbanecm: you _do_ or _don't_ feel comfortable testing it? [11:06:56] I don't feel comfortable with 501335, I do feel comfortable with 486103 [11:07:02] sorry for confusing you [11:07:12] Sorry, got distracted… am here now [11:08:53] hoo: you're a deployer, right? you're deploying your own patch? [11:08:59] yes [11:09:12] hoo: ok, I'll ping you when I'm done [11:09:19] thanks :) [11:09:47] hoo: or do you want to also deploy other one/two patches? in that case swat is yours :) [11:10:04] zeljkof (or hoo, in case he's doing SWAT), I guess my message above is clear to you now? :) [11:10:37] Urbanecm: you can test your and Ammarpad's patch, but not hoo's, right? [11:10:49] exactly [11:10:55] No worries about that :D [11:11:27] hoo: sorry, I'm not sure if I understood you, can you deploy all three patches? [11:11:36] or should I deploy the other two? [11:11:39] (03CR) 10Jcrespo: db-eqiad.php: Promote db1075 to master (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [11:11:41] I can do all them [11:11:45] * all of them [11:11:49] hoo: great, in that case, swat is yours :) [11:11:55] (03PS1) 10Vgutierrez: netbox: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502195 (https://phabricator.wikimedia.org/T220359) [11:11:56] !log upgrading envoy to 1.9.1 T215810 [11:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:00] T215810: Package envoy 1.9.X for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 [11:12:02] !log Restarted thumbor services after librsvg upgrade [11:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:09] Urbanecm: hoo is in charge of swat [11:12:13] ack [11:12:47] (03CR) 10Jcrespo: [C: 03+1] maintain-views: Note explicit exclusion of `oathauth_users` from replicas [puppet] - 10https://gerrit.wikimedia.org/r/496063 (https://phabricator.wikimedia.org/T218165) (owner: 10MarcoAurelio) [11:12:55] Urbanecm: shall we start with https://gerrit.wikimedia.org/r/c/501363/? [11:13:06] up to you hoo [11:13:30] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [11:13:50] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:14:05] 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Consider ways to make puppetmaster CA changes smoother on the puppet client end - https://phabricator.wikimedia.org/T220268 (10GMoney0305) {meme, src=alucombo} [11:17:16] (03PS2) 10Hoo man: Create uploader user group for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501363 (https://phabricator.wikimedia.org/T216615) (owner: 10Urbanecm) [11:18:00] (03CR) 10Hoo man: [C: 03+2] Create uploader user group for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501363 (https://phabricator.wikimedia.org/T216615) (owner: 10Urbanecm) [11:18:54] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:19:18] (03Merged) 10jenkins-bot: Create uploader user group for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501363 (https://phabricator.wikimedia.org/T216615) (owner: 10Urbanecm) [11:19:43] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10serviceops, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10jcrespo) I wonder also if collation or other character-related updates... [11:20:54] 10Operations, 10Commons, 10Multimedia, 10Thumbor, and 2 others: Only one thumbor server (thumbor1002) upgraded to librsvg 2.40.20-3 - https://phabricator.wikimedia.org/T220342 (10jijiki) All servers have been upgraded to 2.40.20-3+wmf1+stretch1 [11:21:32] Urbanecm: Can you have a look at mwdebug1002? [11:21:41] hoo, sure [11:21:42] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [11:22:42] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:23:42] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10serviceops, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) >>! In T219279#5092959, @jcrespo wrote: > I wonder also if collat... [11:24:12] hoo, works fine, thanks [11:24:26] Cool :) I'll go ahead then [11:25:13] (03CR) 10Filippo Giunchedi: "see nit inline, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [11:25:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) (owner: 10Effie Mouzeli) [11:25:49] !log hoo@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Create uploader user group for thwiki (T216615) (duration: 00m 58s) [11:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:53] T216615: Create uploader user group on thwiki - https://phabricator.wikimedia.org/T216615 [11:26:58] (03CR) 10jenkins-bot: Create uploader user group for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501363 (https://phabricator.wikimedia.org/T216615) (owner: 10Urbanecm) [11:27:00] (03CR) 10Giuseppe Lavagetto: Add dependency chain when pruning images (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501185 (owner: 10Giuseppe Lavagetto) [11:27:40] Urbanecm: Shall we continue with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/486103? [11:27:58] hoo, yes please [11:28:01] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Elukey: Requesting ability to scap-deploy on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10elukey) >>! In T220175#5092375, @fgiunchedi wrote: >>>! In T220175#5088610, @elukey wrote: >>>>! In T220175#5088605, @RobH wrote: >... [11:28:29] (03PS11) 10Giuseppe Lavagetto: Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 [11:28:31] (03PS4) 10Giuseppe Lavagetto: Move pulling logic to us, away from the docker daemon [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 [11:28:33] (03PS3) 10Giuseppe Lavagetto: Fix the nightly build behaviour [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501183 [11:28:35] (03PS3) 10Giuseppe Lavagetto: Depend on docker-py 3.x [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501184 [11:28:37] (03PS3) 10Giuseppe Lavagetto: Add dependency chain when pruning images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501185 [11:28:39] (03PS3) 10Giuseppe Lavagetto: Add changelog [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501186 [11:29:54] (03CR) 10Hoo man: [C: 03+2] Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [11:30:54] (03Merged) 10jenkins-bot: Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [11:33:02] (03PS1) 10Filippo Giunchedi: admin: add wdoran to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/502199 (https://phabricator.wikimedia.org/T219898) [11:33:11] Urbanecm: Test on mwdebug1002 please [11:33:55] (03PS2) 10Effie Mouzeli: thumbor: Fix apt pinning for wikimedia-thumbor [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) [11:34:12] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:34:15] (03CR) 10Effie Mouzeli: thumbor: Fix apt pinning for wikimedia-thumbor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) (owner: 10Effie Mouzeli) [11:34:26] hoo, sorry, I'm having problems with my ZNC host [11:34:32] is there anything I missed? [11:34:37] 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic, 10Patch-For-Review, 10User-Ladsgroup: Make UrlShortener 404s cacheable - https://phabricator.wikimedia.org/T220190 (10ema) p:05Triage→03Normal [11:34:52] (03CR) 10Muehlenhoff: [C: 03+1] admin: add wdoran to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/502199 (https://phabricator.wikimedia.org/T219898) (owner: 10Filippo Giunchedi) [11:35:41] Urbanecm: hm, you should be able to see it [11:36:10] 10Operations, 10Performance-Team, 10Traffic: Some load.php requests failing due to "ERR_SPDY_PROTOCOL_ERROR 200" - https://phabricator.wikimedia.org/T220022 (10ema) p:05Triage→03Normal [11:36:31] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: Fix apt pinning for wikimedia-thumbor [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) (owner: 10Effie Mouzeli) [11:36:41] (03PS3) 10Effie Mouzeli: thumbor: Fix apt pinning for wikimedia-thumbor [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) [11:36:46] 10Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic, 10Patch-For-Review: Shortened URLs won't redirect when there's data - https://phabricator.wikimedia.org/T219986 (10ema) p:05Triage→03Normal [11:36:58] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:36:58] Ok, seems to work for me [11:37:03] Urbanecm: good to go? [11:37:51] hoo, I meant I was having problems with my IRC connection :-) [11:38:04] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:38:08] (03CR) 10jenkins-bot: Enable blocking feature of AbuseFilter in zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486103 (https://phabricator.wikimedia.org/T210364) (owner: 10Ammarpad) [11:38:52] Urbanecm: Oh, I see… the change looks good to me on mwdebug1002 [11:38:56] so continue? [11:39:15] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/502199 (https://phabricator.wikimedia.org/T219898) (owner: 10Filippo Giunchedi) [11:39:17] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add wdoran to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/502199 (https://phabricator.wikimedia.org/T219898) (owner: 10Filippo Giunchedi) [11:39:20] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:39:33] hoo, yes [11:40:34] (03PS4) 10Effie Mouzeli: thumbor: Fix apt pinning for wikimedia-thumbor [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) [11:41:26] !log hoo@deploy1001 Synchronized wmf-config/abusefilter.php: Enable blocking feature of AbuseFilter in zh.wikipedia (T210364) (duration: 00m 58s) [11:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:30] T210364: Enable the blocking feature of AbuseFilter on zhwiki - https://phabricator.wikimedia.org/T210364 [11:42:02] (03PS5) 10Effie Mouzeli: thumbor: Fix apt pinning for wikimedia-thumbor [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) [11:42:19] thanks for depoying the patches, hoo [11:42:27] You're welcome :) [11:42:38] (03CR) 10jerkins-bot: [V: 04-1] thumbor: Fix apt pinning for wikimedia-thumbor [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) (owner: 10Effie Mouzeli) [11:43:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) (owner: 10Effie Mouzeli) [11:43:12] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:44:11] (03PS6) 10Effie Mouzeli: thumbor: Fix apt pinning for wikimedia-thumbor [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) [11:44:32] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:46:40] PROBLEM - HHVM rendering on mw2176 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:46:42] PROBLEM - Nginx local proxy to apache on mw2179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:46:42] PROBLEM - Nginx local proxy to apache on mw2170 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:46:42] PROBLEM - HHVM rendering on mw2174 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:46:42] PROBLEM - HHVM rendering on mw2172 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:46:42] PROBLEM - Apache HTTP on mw2172 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:46:46] PROBLEM - Nginx local proxy to apache on mw2175 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:46:52] PROBLEM - Nginx local proxy to apache on mw2172 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:46:54] PROBLEM - Nginx local proxy to apache on mw2178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:02] PROBLEM - HHVM rendering on mw2179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:06] PROBLEM - HHVM processes on mw2176 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:47:08] PROBLEM - HHVM processes on mw2178 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:47:08] PROBLEM - Apache HTTP on mw2176 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:10] PROBLEM - HHVM processes on mw2173 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:47:16] PROBLEM - Nginx local proxy to apache on mw2177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:16] PROBLEM - HHVM processes on mw2171 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:47:18] PROBLEM - Nginx local proxy to apache on mw2176 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:18] PROBLEM - Apache HTTP on mw2177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:24] PROBLEM - HHVM rendering on mw2178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:24] PROBLEM - Apache HTTP on mw2179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:24] PROBLEM - Apache HTTP on mw2170 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:24] PROBLEM - HHVM rendering on mw2175 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:28] PROBLEM - Apache HTTP on mw2173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:28] PROBLEM - Apache HTTP on mw2171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:30] PROBLEM - HHVM rendering on mw2171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:31] PROBLEM - HHVM rendering on mw2177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:32] PROBLEM - Nginx local proxy to apache on mw2174 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:34] PROBLEM - HHVM rendering on mw2173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:36] PROBLEM - Apache HTTP on mw2175 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:36] PROBLEM - Apache HTTP on mw2178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:37] moritzm: this is you right? --^ [11:47:38] PROBLEM - HHVM processes on mw2179 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:47:38] PROBLEM - HHVM rendering on mw2170 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:40] PROBLEM - HHVM processes on mw2172 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:47:42] PROBLEM - Nginx local proxy to apache on mw2173 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:42] PROBLEM - Nginx local proxy to apache on mw2171 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:47:54] PROBLEM - HHVM processes on mw2174 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:47:58] PROBLEM - Apache HTTP on mw2174 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:48:04] PROBLEM - HHVM processes on mw2175 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:48:10] PROBLEM - HHVM processes on mw2170 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:48:10] PROBLEM - HHVM processes on mw2177 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:48:22] meh, it's depooled, just misses proper silenced icinga checks [11:48:26] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:48:48] sorry for the noise [11:48:56] !log contint2001: stopping zuul-server , it is not meant to be running there [11:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:22] RECOVERY - HHVM processes on mw2175 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:49:24] RECOVERY - Nginx local proxy to apache on mw2175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 1.786 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:49:44] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:49:58] RECOVERY - Apache HTTP on mw2170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:00] RECOVERY - Apache HTTP on mw2179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 2.336 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:08] RECOVERY - HHVM rendering on mw2177 is OK: HTTP OK: HTTP/1.1 200 OK - 79440 bytes in 4.203 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:08] RECOVERY - Nginx local proxy to apache on mw2174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 1.936 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:12] (03CR) 10Hoo man: [C: 03+2] WikibaseClient: Conditionally enable mapframe support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501335 (https://phabricator.wikimedia.org/T218051) (owner: 10Hoo man) [11:50:12] RECOVERY - Apache HTTP on mw2178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.916 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:12] RECOVERY - HHVM processes on mw2179 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:50:14] RECOVERY - HHVM processes on mw2172 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:50:16] RECOVERY - Nginx local proxy to apache on mw2173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:18] RECOVERY - Nginx local proxy to apache on mw2171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.301 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:22] RECOVERY - HHVM rendering on mw2170 is OK: HTTP OK: HTTP/1.1 200 OK - 79440 bytes in 9.435 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:28] RECOVERY - HHVM processes on mw2174 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:50:34] RECOVERY - Nginx local proxy to apache on mw2179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:34] RECOVERY - Nginx local proxy to apache on mw2170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.358 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:38] RECOVERY - Apache HTTP on mw2174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 5.323 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:38] RECOVERY - Apache HTTP on mw2172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 3.195 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:40] RECOVERY - HHVM rendering on mw2176 is OK: HTTP OK: HTTP/1.1 200 OK - 79440 bytes in 8.308 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:44] RECOVERY - HHVM rendering on mw2174 is OK: HTTP OK: HTTP/1.1 200 OK - 79440 bytes in 9.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:44] RECOVERY - HHVM rendering on mw2172 is OK: HTTP OK: HTTP/1.1 200 OK - 79440 bytes in 9.347 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:44] RECOVERY - Nginx local proxy to apache on mw2172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:46] RECOVERY - HHVM processes on mw2170 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:50:46] RECOVERY - HHVM processes on mw2177 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:50:48] RECOVERY - Nginx local proxy to apache on mw2178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.608 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:50:58] RECOVERY - HHVM rendering on mw2179 is OK: HTTP OK: HTTP/1.1 200 OK - 79440 bytes in 3.892 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:51:00] RECOVERY - HHVM processes on mw2176 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:51:00] RECOVERY - HHVM processes on mw2178 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:51:02] RECOVERY - Apache HTTP on mw2176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.314 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:51:02] RECOVERY - HHVM processes on mw2173 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:51:08] RECOVERY - Nginx local proxy to apache on mw2177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:51:08] RECOVERY - HHVM processes on mw2171 is OK: PROCS OK: 6 processes with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [11:51:12] RECOVERY - Nginx local proxy to apache on mw2176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:51:12] RECOVERY - Apache HTTP on mw2177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:51:20] RECOVERY - Apache HTTP on mw2171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:51:20] RECOVERY - Apache HTTP on mw2173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:51:22] (03Merged) 10jenkins-bot: WikibaseClient: Conditionally enable mapframe support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501335 (https://phabricator.wikimedia.org/T218051) (owner: 10Hoo man) [11:51:22] RECOVERY - HHVM rendering on mw2178 is OK: HTTP OK: HTTP/1.1 200 OK - 79440 bytes in 5.235 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:51:26] RECOVERY - HHVM rendering on mw2173 is OK: HTTP OK: HTTP/1.1 200 OK - 79438 bytes in 0.352 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:51:26] RECOVERY - HHVM rendering on mw2171 is OK: HTTP OK: HTTP/1.1 200 OK - 79440 bytes in 4.303 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:52:36] RECOVERY - HHVM rendering on mw2175 is OK: HTTP OK: HTTP/1.1 200 OK - 79440 bytes in 3.524 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:52:44] RECOVERY - Apache HTTP on mw2175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:53:34] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:53:39] !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: WikibaseClient: Conditionally enable mapframe support (T218051) (duration: 00m 58s) [11:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:54] (03CR) 10jenkins-bot: WikibaseClient: Conditionally enable mapframe support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501335 (https://phabricator.wikimedia.org/T218051) (owner: 10Hoo man) [12:00:00] !log contint1001 upgraded zuul to 2.5.1-wmf6 # T208426 [12:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] T208426: Deploy Zuul 2.5.1-wmf6 - https://phabricator.wikimedia.org/T208426 [12:00:14] 10Operations, 10Puppet, 10User-jijiki: Add require_package() variant with repository component to wmflib - https://phabricator.wikimedia.org/T178575 (10jijiki) [12:01:27] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: mitaka: stretch: drop puppet cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/502200 [12:03:04] !log T219776 puppet node deactivate labtestnet2003.codfw.wmnet [12:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:08] T219776: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 [12:03:20] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:03:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: serverpackages: mitaka: stretch: drop puppet cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/502200 (owner: 10Arturo Borrero Gonzalez) [12:05:06] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:05:39] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: Fix apt pinning for wikimedia-thumbor [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) (owner: 10Effie Mouzeli) [12:05:50] (03PS7) 10Effie Mouzeli: thumbor: Fix apt pinning for wikimedia-thumbor [puppet] - 10https://gerrit.wikimedia.org/r/502180 (https://phabricator.wikimedia.org/T220342) [12:06:32] (03CR) 10Vgutierrez: [C: 03+1] "LGTM! BTW consider at some point renaming the templates to *.erb" [puppet] - 10https://gerrit.wikimedia.org/r/499669 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [12:06:42] !log reboot cloudvirt1009 to clean some ACPI errors in dmesg [12:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:47] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/502195 (https://phabricator.wikimedia.org/T220359) (owner: 10Vgutierrez) [12:09:00] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:09:54] (03PS1) 10Jbond: facter3/puppet5: Introduce parameters to introduce facter and puppet [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) [12:10:09] (03CR) 10Vgutierrez: [C: 03+2] netbox: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502195 (https://phabricator.wikimedia.org/T220359) (owner: 10Vgutierrez) [12:10:18] (03PS2) 10Vgutierrez: netbox: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502195 (https://phabricator.wikimedia.org/T220359) [12:10:57] (03CR) 10jerkins-bot: [V: 04-1] facter3/puppet5: Introduce parameters to introduce facter and puppet [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:11:18] (03PS1) 10Alex Monk: labs central puppetmaster: Allow cumin functionality to be disabled [puppet] - 10https://gerrit.wikimedia.org/r/502202 (https://phabricator.wikimedia.org/T219421) [12:13:24] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:13:53] (03PS1) 10Hashar: zuul: service was always set to running [puppet] - 10https://gerrit.wikimedia.org/r/502203 [12:14:10] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:14:16] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/502203 (owner: 10Hashar) [12:14:28] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:15:20] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:17:40] PROBLEM - puppet last run on thumbor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:51] ^ that is me [12:17:59] the thumbor puppet errors [12:20:13] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10fgiunchedi) Removing mailing-lists tag since work there is done. [12:20:16] (03PS1) 10Effie Mouzeli: thumbor: Fix package declaration in init.pp [puppet] - 10https://gerrit.wikimedia.org/r/502204 [12:20:22] 10Operations: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10fgiunchedi) Removing mailing-lists tag since work there is done. [12:20:45] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10fgiunchedi) [12:20:59] (03CR) 10Hashar: "I am not sure what is going on :-] I could use facts to be refreshed on the puppet compiler nodes since contint2001.wikimedia.org facts " [puppet] - 10https://gerrit.wikimedia.org/r/502203 (owner: 10Hashar) [12:21:14] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:21:33] (03CR) 10Muehlenhoff: [C: 03+1] thumbor: Fix package declaration in init.pp [puppet] - 10https://gerrit.wikimedia.org/r/502204 (owner: 10Effie Mouzeli) [12:21:54] 10Operations, 10Analytics: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10fgiunchedi) Removing mailing-lists tag since work there is done. [12:22:02] (03CR) 10Effie Mouzeli: [C: 03+2] thumbor: Fix package declaration in init.pp [puppet] - 10https://gerrit.wikimedia.org/r/502204 (owner: 10Effie Mouzeli) [12:22:36] PROBLEM - puppet last run on thumbor1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:24:02] PROBLEM - puppet last run on thumbor2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:25:23] !log upgrading API servers in codfw to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [12:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:27] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [12:25:49] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10fgiunchedi) >>! In T219923#5092203, @akosiaris wrote: >>>! In T219... [12:26:18] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:26:44] PROBLEM - puppet last run on thumbor1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:27:42] RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:28:21] (03PS1) 10Vgutierrez: librenms: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502205 [12:28:50] (03PS2) 10Vgutierrez: librenms: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502205 (https://phabricator.wikimedia.org/T220359) [12:30:46] PROBLEM - puppet last run on mc1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:49] (03PS1) 10Gilles: Treat temp containers as private [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/502206 (https://phabricator.wikimedia.org/T220265) [12:31:03] (03PS2) 10Jbond: facter3/puppet5: Introduce parameters to introduce facter and puppet [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) [12:31:21] !log contint2001: upgraded python-pbr 0.8.2-1 -> 1.10.0-1 # T218559 [12:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:25] T218559: puppet broken on integration WMCS instances due to openstack Debian packages - https://phabricator.wikimedia.org/T218559 [12:31:52] RECOVERY - puppet last run on thumbor1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:31:57] 10Operations, 10Acme-chief, 10Traffic: Provide an staging environment for acme-chief - https://phabricator.wikimedia.org/T220378 (10Vgutierrez) [12:32:17] 10Operations, 10Acme-chief, 10Traffic: Provide an staging environment for acme-chief - https://phabricator.wikimedia.org/T220378 (10Vgutierrez) p:05Triage→03Normal [12:32:58] RECOVERY - puppet last run on thumbor2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:33:39] (03PS1) 10Hashar: zuul: stop pinning python-pbr [puppet] - 10https://gerrit.wikimedia.org/r/502207 (https://phabricator.wikimedia.org/T218559) [12:34:18] RECOVERY - puppet last run on thumbor2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:36:05] (03PS1) 10Ema: hieradata/labs: add profile::cache::ssl::wikibase settings [puppet] - 10https://gerrit.wikimedia.org/r/502208 (https://phabricator.wikimedia.org/T213705) [12:36:26] (03PS2) 10Ema: hieradata/labs: add profile::cache::ssl::wikibase settings [puppet] - 10https://gerrit.wikimedia.org/r/502208 (https://phabricator.wikimedia.org/T213705) [12:37:17] (03Abandoned) 10Hashar: Revert "puppet_alert: Email projectadmins instead of members" [puppet] - 10https://gerrit.wikimedia.org/r/497595 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar) [12:38:49] (03CR) 10Vgutierrez: [C: 03+1] "Looks good to me, but I don't know if this could collide with deployment-prep environment" [puppet] - 10https://gerrit.wikimedia.org/r/502208 (https://phabricator.wikimedia.org/T213705) (owner: 10Ema) [12:39:14] PROBLEM - puppet last run on thumbor1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:40:18] (03PS2) 10Arturo Borrero Gonzalez: labs central puppetmaster: Allow cumin functionality to be disabled [puppet] - 10https://gerrit.wikimedia.org/r/502202 (https://phabricator.wikimedia.org/T219421) (owner: 10Alex Monk) [12:41:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labs central puppetmaster: Allow cumin functionality to be disabled [puppet] - 10https://gerrit.wikimedia.org/r/502202 (https://phabricator.wikimedia.org/T219421) (owner: 10Alex Monk) [12:44:30] RECOVERY - puppet last run on thumbor1003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:46:16] (03PS1) 10Hashar: contint: puppet-lint is no more needed [puppet] - 10https://gerrit.wikimedia.org/r/502211 [12:46:43] (03PS5) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [12:47:15] (03PS3) 10Elukey: admin: remove sudo permissions from gpu-testers and add users to it [puppet] - 10https://gerrit.wikimedia.org/r/501575 (https://phabricator.wikimedia.org/T148843) [12:48:03] (03CR) 10Jbond: "Thanks for the review cas, i think i have addresses all your comments accept the statsd one. Im not sure if that should remain or not. i" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:48:38] (03CR) 10Elukey: [C: 03+2] "Merging without SRE's meeting due to the following considerations:" [puppet] - 10https://gerrit.wikimedia.org/r/501575 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [12:50:45] (03PS2) 10Arturo Borrero Gonzalez: Remove now obsolete tools-checker-grid-start-trusty monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/502167 (owner: 10Muehlenhoff) [12:51:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/502167 (owner: 10Muehlenhoff) [12:52:56] 10Operations, 10Continuous-Integration-Infrastructure: Upload Zuul 2.5.1-wmf6 package to apt.wikimedia.org - https://phabricator.wikimedia.org/T220380 (10hashar) [12:53:35] (03Abandoned) 10Jbond: facter3 and puppet5: add repositories for puppet5 and facter3 [puppet] - 10https://gerrit.wikimedia.org/r/501618 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:54:52] (03Abandoned) 10Arturo Borrero Gonzalez: Revert "wmnet: introduce cloudnet2003-dev.codfw.wmnet FQDNs" [dns] - 10https://gerrit.wikimedia.org/r/500426 (owner: 10Arturo Borrero Gonzalez) [12:55:25] (03PS2) 10Arturo Borrero Gonzalez: hieradata: openstack: fix path of wikitech phab API token [labs/private] - 10https://gerrit.wikimedia.org/r/501175 [12:56:22] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: openstack: fix path of wikitech phab API token [labs/private] - 10https://gerrit.wikimedia.org/r/501175 (owner: 10Arturo Borrero Gonzalez) [12:57:04] RECOVERY - puppet last run on mc1027 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:00:57] (03PS3) 10Marostegui: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) [13:03:16] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10aborrero) 05Open→03Resolved ` aborrero@cumin1001:~ $ sudo cumin labtestnet2003* No hosts found that ma... [13:11:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline" (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/502206 (https://phabricator.wikimedia.org/T220265) (owner: 10Gilles) [13:15:31] (03CR) 10Effie Mouzeli: [C: 03+1] Upgrade to 2.3 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/488060 (https://phabricator.wikimedia.org/T198370) (owner: 10Gilles) [13:18:41] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-Joe: publish 1.9.1 envoy docker image - https://phabricator.wikimedia.org/T220382 (10fsero) [13:18:46] (03CR) 10Filippo Giunchedi: [C: 03+2] DNS: Change production DNS for restbase2019 and restbase2020 [dns] - 10https://gerrit.wikimedia.org/r/501686 (owner: 10Papaul) [13:19:15] (03CR) 10Muehlenhoff: facter3/puppet5: Introduce parameters to introduce facter and puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:23:32] (03PS1) 10Vgutierrez: add acmechief-test[12]001 instances [dns] - 10https://gerrit.wikimedia.org/r/502216 (https://phabricator.wikimedia.org/T220378) [13:23:59] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10MoritzMuehlenhoff) >>! In T215415#5032971, @Papaul wrote: > Also the error I have here is not telling me which memory row or channel it refers to so it's difficult to tell... [13:24:20] (03CR) 10Volans: [C: 03+1] "LGTM, a nit inline. The deployment of this must be coordinated with the change on the API side." (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 (owner: 10CRusnov) [13:25:48] (03PS1) 10Fsero: Updating envoy to 1.9.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/502217 (https://phabricator.wikimedia.org/T220382) [13:25:52] (03PS3) 10Jbond: facter3/puppet5: Introduce parameters to introduce facter and puppet [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) [13:26:06] (03CR) 10Jbond: facter3/puppet5: Introduce parameters to introduce facter and puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:27:02] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/502203 (owner: 10Hashar) [13:28:36] (03CR) 10Fsero: "Build it and tested it locally" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/502217 (https://phabricator.wikimedia.org/T220382) (owner: 10Fsero) [13:31:29] (03PS1) 10Gehel: elasticsearch: reset all indices to read/write [software/spicerack] - 10https://gerrit.wikimedia.org/r/502218 (https://phabricator.wikimedia.org/T219799) [13:31:52] (03CR) 10Dzahn: "code looks like it should disable it on 2001 though.. ehm..." [puppet] - 10https://gerrit.wikimedia.org/r/502203 (owner: 10Hashar) [13:32:57] Anyone around who knows anything about VIPS/VipsScaler? [13:33:27] (03CR) 10Dzahn: "well.. the service already IS currently stopped on contint2001 so i guess that matches" [puppet] - 10https://gerrit.wikimedia.org/r/502203 (owner: 10Hashar) [13:34:52] (03CR) 10Muehlenhoff: facter3/puppet5: Introduce parameters to introduce facter and puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:34:54] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/502218 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [13:35:21] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: reset all indices to read/write [software/spicerack] - 10https://gerrit.wikimedia.org/r/502218 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [13:36:17] 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [13:36:32] 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) p:05Triage→03Normal [13:36:35] 10Operations, 10Wikimedia-Mailing-lists: Create MoveCom mailing list for Movement communications group - https://phabricator.wikimedia.org/T218367 (10fgiunchedi) >>! In T218367#5026199, @Varnent wrote: > Actually, this will essentially be a replacement for the existing ComCom list. Perhaps for archive preserva... [13:37:06] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) @MoritzMuehlenhoff yes we do have some. Will replace A2 once on site. [13:39:12] 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [13:41:18] (03CR) 10Vgutierrez: [C: 03+2] add acmechief-test[12]001 instances [dns] - 10https://gerrit.wikimedia.org/r/502216 (https://phabricator.wikimedia.org/T220378) (owner: 10Vgutierrez) [13:41:22] (03PS2) 10Vgutierrez: add acmechief-test[12]001 instances [dns] - 10https://gerrit.wikimedia.org/r/502216 (https://phabricator.wikimedia.org/T220378) [13:41:52] !log upgrading job runners in codfw to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [13:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:55] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [13:44:27] (03PS4) 10Jbond: facter3/puppet5: Introduce parameters to introduce facter and puppet [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) [13:44:36] (03CR) 10Jbond: facter3/puppet5: Introduce parameters to introduce facter and puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:45:02] (03CR) 10jerkins-bot: [V: 04-1] facter3/puppet5: Introduce parameters to introduce facter and puppet [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:45:07] RECOVERY - Host restbase2019 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [13:45:31] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:46:31] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [13:48:25] (03PS1) 10Alex Monk: labs cumin: Allow running nfs_hostlist script outside a puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/502219 (https://phabricator.wikimedia.org/T219421) [13:49:17] (03PS1) 10Gehel: elasticsearch: reset all indices to read/write [cookbooks] - 10https://gerrit.wikimedia.org/r/502220 (https://phabricator.wikimedia.org/T219799) [13:49:52] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10MoritzMuehlenhoff) During the HHVM updates I noticed that mw2151 is in site.pp as a jobrunner, but not listed in conftool-data. [13:50:15] (03PS2) 10Gehel: elasticsearch: reset all indices to read/write [software/spicerack] - 10https://gerrit.wikimedia.org/r/502218 (https://phabricator.wikimedia.org/T219799) [13:50:20] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: reset all indices to read/write [cookbooks] - 10https://gerrit.wikimedia.org/r/502220 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [13:50:47] 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10BBlack) * 0100-dynamic-tls-records.patch - I don't think we ever managed to prove a significant benefit from this on initial deploy, but it's just one of those things that seemed like a "good idea" so long as it rema... [13:51:50] (03PS1) 10Dzahn: cassandra: include passwords in instance for testing [puppet] - 10https://gerrit.wikimedia.org/r/502221 [13:53:33] (03CR) 10jerkins-bot: [V: 04-1] cassandra: include passwords in instance for testing [puppet] - 10https://gerrit.wikimedia.org/r/502221 (owner: 10Dzahn) [13:53:59] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:54:07] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add an update action (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [13:54:22] 10Operations, 10Commons, 10Multimedia, 10Thumbor, and 2 others: Only one thumbor server (thumbor1002) upgraded to librsvg 2.40.20-3 - https://phabricator.wikimedia.org/T220342 (10jijiki) 05Open→03Resolved [13:54:42] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/15634/authdns2001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/456317 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [13:54:44] (03PS2) 10BBlack: Add CNAME-variant langlist template [dns] - 10https://gerrit.wikimedia.org/r/501628 (https://phabricator.wikimedia.org/T208263) [13:54:46] (03PS2) 10BBlack: wiktionary: test with zone-local CNAME->DYNA [dns] - 10https://gerrit.wikimedia.org/r/501629 (https://phabricator.wikimedia.org/T208263) [13:54:59] (03CR) 10Volans: [C: 03+1] "LGTM, if someone else could verify the ES details it would be great." [software/spicerack] - 10https://gerrit.wikimedia.org/r/502218 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [13:55:23] 10Operations, 10Puppet, 10User-jijiki: Add require_package() variant with repository component to wmflib - https://phabricator.wikimedia.org/T178575 (10herron) Since today we have a mix of `package` and `require_package` this would be very nice indeed. Does it need to be homegrown? Seems worthwhile to weigh... [13:55:42] (03PS3) 10Elukey: ores::base: fix package requires for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) [13:55:45] (03CR) 10Dzahn: [C: 04-1] "this isn't it. https://puppet-compiler.wmflabs.org/compiler1002/15641/sessionstore1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502221 (owner: 10Dzahn) [13:56:09] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:56:21] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/502220 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [13:57:25] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10serviceops, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) Even when backporting the above patches, I tried to run a simple... [13:58:11] urandom: where does the default for super_username even come from ?:) [13:58:36] still trying to understand that issue.. [13:59:07] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:59:56] 10Operations, 10Analytics, 10EventBus, 10serviceops, and 5 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10Ottomata) [14:00:01] 10Operations, 10Analytics, 10EventBus, 10serviceops, and 5 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10Ottomata) 05Open→03Resolved [14:00:19] 10Operations, 10Traffic, 10HTTPS, 10Tracking-Neverending: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681 (10Vgutierrez) [14:00:22] 10Operations, 10Traffic, 10HTTPS: Make sure that services available for NDA-only users are using strong TLS ciphersuites - https://phabricator.wikimedia.org/T217002 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [14:00:23] (03CR) 10Elukey: [C: 03+2] ores::base: fix package requires for Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/501608 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [14:01:47] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:02:16] (03PS2) 10Fsero: Updating envoy to 1.9.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/502217 (https://phabricator.wikimedia.org/T220382) [14:02:25] 10Operations, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10herron) p:05Triage→03Normal [14:02:31] (03PS2) 10Dzahn: cassandra: change superuser_password for testing [puppet] - 10https://gerrit.wikimedia.org/r/502221 [14:02:40] (03CR) 10Gehel: [C: 04-1] "A few comments already, a few more are on their way." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [14:03:13] (03CR) 10jerkins-bot: [V: 04-1] cassandra: change superuser_password for testing [puppet] - 10https://gerrit.wikimedia.org/r/502221 (owner: 10Dzahn) [14:04:44] elukey: im mostly away this week (and last) and next... but those wancache thing is in the train that is currently being deployed, so I guess we will see it go out to group1 today? [14:05:52] 10Operations: Review current architecture/capacity and establish plan for Kafka main cluster upgrade/refresh to cover needs for next 2-3 years - https://phabricator.wikimedia.org/T220389 (10herron) [14:05:54] 10Operations: Audit existing Kafka main producers/consumers and document their configuration and use cases - https://phabricator.wikimedia.org/T220390 (10herron) [14:05:56] 10Operations: Establish guideline documentation for Kafka cluster use cases (main, jumbo, logging, etc.) - https://phabricator.wikimedia.org/T220391 (10herron) [14:06:04] !log Temporarily serve thumbor traffic on thumbor1001 via haproxy - T187765 [14:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:08] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [14:06:36] addshore: ah group1 is today? Good! [14:06:39] 10Operations, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10herron) [14:06:57] elukey: i believe so, its the train from last week, as long as it is unblocked now [14:06:58] addshore: yeah I am going to track the change as it is deploy, will report back in the task [14:07:05] wmf.24 [14:07:08] *it is deployed [14:07:22] give me a ping if your here when it happens too and I'll be close by! [14:07:27] 10Operations: Establish guideline documentation for Kafka cluster use cases (main, jumbo, logging, etc.) - https://phabricator.wikimedia.org/T220391 (10herron) [14:08:02] 10Operations: Audit existing Kafka main producers/consumers and document their configuration and use cases - https://phabricator.wikimedia.org/T220390 (10herron) [14:08:04] 10Operations: Review current architecture/capacity and establish plan for Kafka main cluster upgrade/refresh to cover needs for next 2-3 years - https://phabricator.wikimedia.org/T220389 (10herron) [14:08:07] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10herron) [14:08:10] 10Operations, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10herron) [14:08:25] addshore: ack thanks! [14:09:37] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [14:09:54] (03PS3) 10Dzahn: cassandra: change superuser_password for testing [puppet] - 10https://gerrit.wikimedia.org/r/502221 [14:09:59] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [14:10:03] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:10:25] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:10:51] (03CR) 10jerkins-bot: [V: 04-1] cassandra: change superuser_password for testing [puppet] - 10https://gerrit.wikimedia.org/r/502221 (owner: 10Dzahn) [14:11:38] (03CR) 10Dzahn: "even setting the default to 'foo' here is a noop. adding a syntax error proofs it fails on sessionstore1001 though" [puppet] - 10https://gerrit.wikimedia.org/r/502221 (owner: 10Dzahn) [14:12:57] 10Operations, 10Traffic, 10netops: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10BBlack) [14:13:52] (03PS1) 10Anomie: Set ActorTableSchemaMigrationStage => write-both/read-new on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502226 (https://phabricator.wikimedia.org/T188327) [14:14:35] (03PS5) 10Jbond: facter3/puppet5: Introduce parameters to introduce facter and puppet [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) [14:14:47] (03CR) 10Anomie: [C: 03+2] "Deploying planned config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502226 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [14:15:49] (03Merged) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-new on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502226 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [14:17:11] RECOVERY - Host restbase2020 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [14:17:27] (03PS6) 10Jbond: facter3/puppet5: Introduce parameters to introduce facter and puppet [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) [14:17:34] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-both/read-new on test wikis and mediawikiwiki (T188327) (duration: 00m 59s) [14:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:38] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [14:17:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10Cmjohnson) @robh do you mean xe-2/0/27 and 2/028 ....both have link lights and are new out of the box cables. [14:19:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1018 with 10G interfaces - https://phabricator.wikimedia.org/T217347 (10Cmjohnson) @robh or @andrewbogott I don't know if a change was made but I see link lights on both ports for cloudvirt1018 [14:20:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10Cmjohnson) I replaced the cables on 1024 but still do not get a link. I see 1018 is working now...was a change there made? [14:21:57] 10Operations, 10Goal: TEC6: Database Automation - https://phabricator.wikimedia.org/T220395 (10CDanis) [14:22:03] (03CR) 10Herron: [C: 03+1] Add default SPF record for canonical domains [dns] - 10https://gerrit.wikimedia.org/r/499255 (https://phabricator.wikimedia.org/T193408) (owner: 10Vgutierrez) [14:22:15] \o/ [14:22:42] (03CR) 10Muehlenhoff: facter3/puppet5: Introduce parameters to introduce facter and puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [14:22:52] 10Operations, 10Goal: TEC6: Database Automation - https://phabricator.wikimedia.org/T220395 (10CDanis) [14:22:54] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10CDanis) [14:23:44] (03CR) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-new on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502226 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [14:24:10] 10Operations, 10Recommendation-API, 10Wikimedia-Logstash, 10service-runner, and 3 others: Move recommendation-api logging to new logging pipeline - https://phabricator.wikimedia.org/T219926 (10bmansurov) [14:24:17] (03CR) 10Jbond: facter3/puppet5: Introduce parameters to introduce facter and puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [14:25:07] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/502205 (https://phabricator.wikimedia.org/T220359) (owner: 10Vgutierrez) [14:26:26] 10Operations, 10Recommendation-API, 10Wikimedia-Logstash, 10service-runner, and 3 others: Move recommendation-api logging to new logging pipeline - https://phabricator.wikimedia.org/T219926 (10bmansurov) @Pchelolo, once the patch is merged, is it safe to deploy or are we waiting on something else? [14:27:07] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:27:21] (03PS1) 10Vgutierrez: install_server: bare minimum puppetization to install acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502229 (https://phabricator.wikimedia.org/T220378) [14:28:34] (03CR) 10Vgutierrez: [C: 03+2] Add default SPF record for canonical domains [dns] - 10https://gerrit.wikimedia.org/r/499255 (https://phabricator.wikimedia.org/T193408) (owner: 10Vgutierrez) [14:28:39] (03PS2) 10Vgutierrez: Add default SPF record for canonical domains [dns] - 10https://gerrit.wikimedia.org/r/499255 (https://phabricator.wikimedia.org/T193408) [14:30:05] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:30:25] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:30:36] !log otto@deploy1001 Started deploy [analytics/refinery@7fa6fb7]: deploying oozie article recommender for baho [14:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:12] 10Operations, 10Mail, 10Patch-For-Review: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408 (10Vgutierrez) [14:33:11] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:33:20] (03PS1) 10Alexandros Kosiaris: kubernetes: Update to default 1.11 default admission plugins [puppet] - 10https://gerrit.wikimedia.org/r/502230 [14:36:03] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Documentation: TEC3:O6:O:6.1:Q4: Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T220397 (10akosiaris) [14:37:42] 10Operations, 10Analytics, 10EventBus, 10Patch-For-Review, 10Services (watching): Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10akosiaris) 05Open→03Stalled Stalling until we have some sane solution. [14:38:01] (03CR) 10Vgutierrez: [C: 03+2] librenms: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502205 (https://phabricator.wikimedia.org/T220359) (owner: 10Vgutierrez) [14:38:09] (03PS3) 10Vgutierrez: librenms: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502205 (https://phabricator.wikimedia.org/T220359) [14:38:12] 10Operations, 10User-herron: Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) - https://phabricator.wikimedia.org/T220387 (10elukey) One thing that we didn't discuss for this goal is Zookeeper. At the moment multiple things are using conf100[4-6] hosts:... [14:40:17] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10... [14:40:33] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10... [14:40:37] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Documentation: TEC3:O6:O:6.1:Q4: Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T220397 (10akosiaris) p:05Triage→03Normal [14:41:41] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10akosiaris) [14:41:46] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10akosiaris) p:05Triage→03Normal [14:41:53] (03PS1) 10Andrew Bogott: cloudvirt1024: update nic name [puppet] - 10https://gerrit.wikimedia.org/r/502232 (https://phabricator.wikimedia.org/T216724) [14:41:58] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10... [14:42:16] (03PS2) 10Andrew Bogott: cloudvirt1024: update nic name [puppet] - 10https://gerrit.wikimedia.org/r/502232 (https://phabricator.wikimedia.org/T216724) [14:42:23] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: Migrate ORES to kubernetes - https://phabricator.wikimedia.org/T220400 (10akosiaris) [14:42:54] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) [14:43:04] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) p:05Triage→03Normal [14:43:28] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1024: update nic name [puppet] - 10https://gerrit.wikimedia.org/r/502232 (https://phabricator.wikimedia.org/T216724) (owner: 10Andrew Bogott) [14:43:49] (03CR) 10Muehlenhoff: [C: 03+1] "One comment inline, looks good to me!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [14:43:52] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) [14:44:01] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) p:05Triage→03Normal [14:44:06] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: Migrate ORES to kubernetes - https://phabricator.wikimedia.org/T220400 (10akosiaris) p:05Triage→03Normal [14:44:37] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [14:45:26] (03PS1) 10Elukey: Rely only on ores::base for common packages deployed to Analytics misc [puppet] - 10https://gerrit.wikimedia.org/r/502233 (https://phabricator.wikimedia.org/T148843) [14:45:29] !log powering down elastic2048 for disk replacement [14:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:41] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - proton_24766: Servers proton1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:45:48] 10Operations, 10serviceops: TEC3:Q4 Tracking task - https://phabricator.wikimedia.org/T220403 (10akosiaris) [14:46:05] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Documentation: TEC3:O6:O:6.1:Q4: Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T220397 (10akosiaris) [14:46:08] 10Operations, 10serviceops: TEC3:Q4 Tracking task - https://phabricator.wikimedia.org/T220403 (10akosiaris) [14:46:11] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10... [14:46:19] 10Operations, 10serviceops: TEC3:Q4 Tracking task - https://phabricator.wikimedia.org/T220403 (10akosiaris) p:05Triage→03Normal [14:46:28] (03CR) 10Elukey: "Does it make sense or is there any reason not to do it? :)" [puppet] - 10https://gerrit.wikimedia.org/r/502233 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [14:46:53] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:47:31] 10Operations, 10serviceops: TEC3:05:05.1:Q4 Services and the deployment pipeline are hosted on production-level infrastructure - https://phabricator.wikimedia.org/T220405 (10akosiaris) [14:47:35] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,create,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:47:39] 10Operations, 10serviceops: TEC3:05:05.1:Q4 Services and the deployment pipeline are hosted on production-level infrastructure - https://phabricator.wikimedia.org/T220405 (10akosiaris) p:05Triage→03Normal [14:47:43] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={GET,LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:48:08] xm, let's look at argon [14:48:09] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [14:48:21] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={GET,LIST} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:48:27] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:48:41] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:48:47] wow etcd request latencies have increased a lot [14:48:52] 1.3s for a create? [14:48:55] what is going on? [14:49:11] cool replacing the DIMM in 5 mins [14:49:12] !log otto@deploy1001 Finished deploy [analytics/refinery@7fa6fb7]: deploying oozie article recommender for baho (duration: 18m 35s) [14:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:16] (03PS6) 10BBlack: Shortener VCL fixups [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup) [14:49:27] moritzm: ^ [14:50:28] (03PS1) 10Muehlenhoff: Don't install facter 2.4 in buster installs [puppet] - 10https://gerrit.wikimedia.org/r/502234 (https://phabricator.wikimedia.org/T219803) [14:50:30] papaul: ack, thx [14:51:14] (03CR) 10BryanDavis: [C: 03+1] Remove now obsolete tools-checker-grid-start-trusty monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/502167 (owner: 10Muehlenhoff) [14:52:01] (03PS1) 10Alex Monk: profile::puppetmaster::frontend: Allow getting allow_from from hiera [puppet] - 10https://gerrit.wikimedia.org/r/502235 (https://phabricator.wikimedia.org/T171188) [14:52:23] (03CR) 10BBlack: [C: 03+2] Shortener VCL fixups [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup) [14:52:25] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:52:27] (03PS2) 10Vgutierrez: install_server: bare minimum puppetization to install acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502229 (https://phabricator.wikimedia.org/T220378) [14:52:31] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:52:36] it is recovering but unclear yet what the issue was [14:52:43] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:52:51] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10Papaul) a:05Papaul→03Gehel @Gehel Disk replaced. Let me know how the reimage goes before i send back the bad disk. Thanks [14:53:37] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:53:46] ah, there we go [14:53:47] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:53:48] (03CR) 10Legoktm: Shortener VCL fixups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup) [14:53:58] etcd1003 suffered from CPU utilization [14:54:00] (03CR) 10Legoktm: "Couldn't -1 in time :/" [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup) [14:54:03] _joe_: ^ [14:54:12] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=3&fullscreen&orgId=1&var-server=etcd1003&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kubernetes&from=now-1h&to=now [14:54:15] legoktm: ? [14:54:25] (03CR) 10Vgutierrez: [C: 03+2] install_server: bare minimum puppetization to install acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502229 (https://phabricator.wikimedia.org/T220378) (owner: 10Vgutierrez) [14:54:26] it has recovered, but it was kind enough to warn us [14:54:33] (03PS3) 10Vgutierrez: install_server: bare minimum puppetization to install acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502229 (https://phabricator.wikimedia.org/T220378) [14:54:36] actually, moving the etcd talk into #-serviceops [14:54:43] bblack: I think we should be keeping the shortcode validation in the VCL [14:55:00] I'm actually about to leave for school, I'll try and go back through our old discussions as to why we added it [14:55:04] !log powering down mw2206 for DIMM replacement [14:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:07] I do remember that it was there for a specific reason [14:55:08] legoktm: we can for now, but in the long run we really shouldn't have to [14:55:18] (if there's a bug, fix it in MW, basically) [14:55:50] but yeah I'm sure that's not easy with the current rewrite scheme [14:55:57] (03CR) 10Dzahn: "you might want to consider "profile::base::notifications: disabled" in Hiera since these are test servers" [puppet] - 10https://gerrit.wikimedia.org/r/502229 (https://phabricator.wikimedia.org/T220378) (owner: 10Vgutierrez) [14:56:10] (the really-right answer is MW should have its own URL routing layer and understand w.wiki URLs natively without the rewrite to meta) [14:56:44] (03CR) 10Vgutierrez: [C: 03+2] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/502229 (https://phabricator.wikimedia.org/T220378) (owner: 10Vgutierrez) [14:56:51] the dream :) [14:57:21] thx for the comment mutante :) [15:00:29] PROBLEM - Host mw2206.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:00:55] (03PS2) 10Filippo Giunchedi: Add restbase2019 / restbase2020 instances [dns] - 10https://gerrit.wikimedia.org/r/501525 (https://phabricator.wikimedia.org/T217368) [15:01:02] (03PS1) 10BBlack: Shortener: add back charset validation [puppet] - 10https://gerrit.wikimedia.org/r/502238 (https://phabricator.wikimedia.org/T219986) [15:01:14] gerrit seems slow-ish today [15:02:09] (03CR) 10BBlack: [C: 03+2] Shortener: add back charset validation [puppet] - 10https://gerrit.wikimedia.org/r/502238 (https://phabricator.wikimedia.org/T219986) (owner: 10BBlack) [15:03:19] (03CR) 10Bstorm: [C: 03+1] "Seems legit. I'll wait a bit in case there's something I'm missing before merging." [puppet] - 10https://gerrit.wikimedia.org/r/499669 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [15:05:08] (03PS2) 10Filippo Giunchedi: restbase: add restbase2019 / restbase2020 [puppet] - 10https://gerrit.wikimedia.org/r/501526 (https://phabricator.wikimedia.org/T217368) [15:05:28] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10Papaul) a:05Papaul→03jijiki DIMM_A2 replaced [15:06:29] RECOVERY - Host mw2206.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.87 ms [15:06:32] (03CR) 10BryanDavis: "> Just for completeness. Have we considered running the proxy inside" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [15:07:39] (03PS1) 10Dzahn: cassandra: set super_user, super_password explicitly [puppet] - 10https://gerrit.wikimedia.org/r/502240 (https://phabricator.wikimedia.org/T219560) [15:08:10] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10Papaul) a:05Papaul→03fgiunchedi [15:08:45] (03CR) 10jerkins-bot: [V: 04-1] cassandra: set super_user, super_password explicitly [puppet] - 10https://gerrit.wikimedia.org/r/502240 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [15:08:54] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10jijiki) @Papaul thank you! Pooling ... [15:09:03] !log Pool mw2206 - T215415 [15:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:06] T215415: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 [15:10:14] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: mw2206.codfw.wmnet memory issues - https://phabricator.wikimedia.org/T215415 (10jijiki) 05Open→03Resolved Will reopen if there are issues. Thank you all! [15:10:37] legoktm: all fixed now [15:10:41] 10Operations, 10Fundraising-Backlog, 10Mail, 10fundraising-tech-ops: Identify appropriate SPF record for domain wikimediafoundation.org - https://phabricator.wikimedia.org/T220412 (10herron) p:05Triage→03Normal [15:11:15] (03PS7) 10Jbond: facter3/puppet5: Introduce parameters to introduce facter and puppet [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) [15:13:51] (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: Introduce parameters to introduce facter and puppet [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:14:17] (03CR) 10Jbond: facter3/puppet5: Introduce parameters to introduce facter and puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:14:19] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: Introduce parameters to introduce facter and puppet [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:14:22] mutante: modules/cassandra/manifests/init.pp:137: $super_password = 'cassandra', [15:14:28] (03PS8) 10Jbond: facter3/puppet5: Introduce parameters to introduce facter and puppet [puppet] - 10https://gerrit.wikimedia.org/r/502201 (https://phabricator.wikimedia.org/T219803) [15:14:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes: Update to default 1.11 default admission plugins [puppet] - 10https://gerrit.wikimedia.org/r/502230 (owner: 10Alexandros Kosiaris) [15:15:03] (03PS2) 10Alexandros Kosiaris: kubernetes: Update to default 1.11 default admission plugins [puppet] - 10https://gerrit.wikimedia.org/r/502230 [15:15:07] (03PS9) 10CRusnov: Break report into parts and adjust the way devices are filtered [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 [15:16:43] (03CR) 10CRusnov: "> Patch Set 8: Code-Review+1" (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 (owner: 10CRusnov) [15:16:53] (03PS8) 10Herron: ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) [15:17:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] confd: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456317 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [15:18:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] apertium: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456316 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [15:19:38] (03CR) 10Eevans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/501525 (https://phabricator.wikimedia.org/T217368) (owner: 10Filippo Giunchedi) [15:19:48] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501526 (https://phabricator.wikimedia.org/T217368) (owner: 10Filippo Giunchedi) [15:19:57] !log switching ores to logstash kafka logging pipeline (via temporary puppet disable and rolling puppet agent runs) [15:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:13] (03CR) 10Filippo Giunchedi: [C: 03+2] Add restbase2019 / restbase2020 instances [dns] - 10https://gerrit.wikimedia.org/r/501525 (https://phabricator.wikimedia.org/T217368) (owner: 10Filippo Giunchedi) [15:22:07] (03CR) 10Herron: [C: 03+2] ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [15:25:38] (03PS3) 10Alexandros Kosiaris: kubernetes: Update to default 1.11 default admission plugins [puppet] - 10https://gerrit.wikimedia.org/r/502230 [15:25:41] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] kubernetes: Update to default 1.11 default admission plugins [puppet] - 10https://gerrit.wikimedia.org/r/502230 (owner: 10Alexandros Kosiaris) [15:26:37] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:50] (03PS3) 10Filippo Giunchedi: restbase: add restbase2019 / restbase2020 [puppet] - 10https://gerrit.wikimedia.org/r/501526 (https://phabricator.wikimedia.org/T217368) [15:27:31] (03CR) 10Filippo Giunchedi: [C: 03+2] restbase: add restbase2019 / restbase2020 [puppet] - 10https://gerrit.wikimedia.org/r/501526 (https://phabricator.wikimedia.org/T217368) (owner: 10Filippo Giunchedi) [15:27:45] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:05] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 (owner: 10CRusnov) [15:29:17] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:32:14] (03CR) 10CRusnov: [C: 03+2] puppetdb_microservice: Redo how it returns values [puppet] - 10https://gerrit.wikimedia.org/r/501104 (owner: 10CRusnov) [15:32:24] (03PS4) 10CRusnov: puppetdb_microservice: Redo how it returns values [puppet] - 10https://gerrit.wikimedia.org/r/501104 [15:32:53] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:33:16] (03CR) 10CRusnov: [C: 03+2] Break report into parts and adjust the way devices are filtered [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 (owner: 10CRusnov) [15:33:40] (03PS1) 10Herron: Revert "ores: ship to logstash via the kafka logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/502256 [15:34:37] (03CR) 10jerkins-bot: [V: 04-1] Revert "ores: ship to logstash via the kafka logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/502256 (owner: 10Herron) [15:35:03] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:35:30] !log aborting ores to logstash kafka logging pipeline switchover for now. puppet applied only to ores2009, reverting now [15:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:02] (03PS2) 10Herron: Revert "ores: ship to logstash via the kafka logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/502256 [15:36:11] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:37:10] (03PS3) 10Herron: Revert "ores: ship to logstash via the kafka logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/502256 [15:37:48] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10Papaul) Return information {F28595454} [15:38:10] ^ jbond42: this fails on the remaining trustys in the validation of an apt:repository parameter for the mitaka repository [15:38:15] (03CR) 10Herron: [C: 03+2] Revert "ores: ship to logstash via the kafka logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/502256 (owner: 10Herron) [15:38:39] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Apt::Repository[ubuntucloud]: parameter 'keyfile' expects a match for Stdlib::Unixpath = Pattern[/^\/([^\/\0]+\/*)*$/], got 'puppet:///modules/openstack/serverpackages/mitaka/trusty/ubuntu-cloud.key' at /etc/puppet/modules/openstack/manifests/serverpackages/mitaka/trusty. [15:38:40] pp:7 on node labvirt1002.eqiad.wmnet [15:39:09] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:39:43] it contains a puppet:// URI instead of a unix path [15:40:26] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10serviceops, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) I tried to write a simplistic implementation of such a function:... [15:41:19] moritzm puppet is also failing for me with a similar error to that. [15:42:09] moritzm: thanks looking [15:42:12] jbond42: I'd say let's switch to Optional[String] instead [15:42:31] yes agree [15:42:34] (03CR) 10Bstorm: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/501384 (https://phabricator.wikimedia.org/T219652) (owner: 10Bstorm) [15:43:07] (03PS1) 10Vgutierrez: mirrors: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502260 (https://phabricator.wikimedia.org/T220359) [15:43:54] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10Marostegui) Those two hosts were added to site.pp and have been added to the install recipe a few days ago. [15:44:54] (03PS1) 10Jbond: apt::repository: fix paramater validation [puppet] - 10https://gerrit.wikimedia.org/r/502261 [15:45:10] moritzm: ^^ [15:45:13] (03PS1) 10Vgutierrez: tendril: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502263 (https://phabricator.wikimedia.org/T220359) [15:45:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/502261 (owner: 10Jbond) [15:45:41] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:17] (03CR) 10Jbond: [C: 03+2] apt::repository: fix paramater validation [puppet] - 10https://gerrit.wikimedia.org/r/502261 (owner: 10Jbond) [15:47:30] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10serviceops, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) It should be noted that if this seems too slow, we can still "fix... [15:48:27] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:33] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:27] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:32] paladox: should be fixed with next puppet run [15:49:40] thanks moritzm! [15:50:41] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:45] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:53:15] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:22] 10Operations, 10Wikimedia-Mailing-lists: Reset password for wikimedia-gh mailing list - https://phabricator.wikimedia.org/T220416 (10Nkansahrexford) [15:53:41] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:54:19] (03PS6) 10Bstorm: maintain-views: Note explicit exclusion of `oathauth_users` from replicas [puppet] - 10https://gerrit.wikimedia.org/r/496063 (https://phabricator.wikimedia.org/T218165) (owner: 10MarcoAurelio) [15:55:11] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:55:22] (03PS1) 10CRusnov: Minor fixes [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/502264 [15:55:44] (03CR) 10Bstorm: [C: 03+2] maintain-views: Note explicit exclusion of `oathauth_users` from replicas [puppet] - 10https://gerrit.wikimedia.org/r/496063 (https://phabricator.wikimedia.org/T218165) (owner: 10MarcoAurelio) [15:57:43] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:58:39] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:59:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Cool, thanks for researching that!" [puppet] - 10https://gerrit.wikimedia.org/r/501384 (https://phabricator.wikimedia.org/T219652) (owner: 10Bstorm) [15:59:31] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:00:49] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:00:56] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/502264 (owner: 10CRusnov) [16:01:38] (03CR) 10CRusnov: [C: 03+2] Minor fixes [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/502264 (owner: 10CRusnov) [16:04:19] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:13:19] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:14:07] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:14:29] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:15:13] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:18:37] (03CR) 10Lucas Werkmeister (WMDE): "Apparently the whole $wmg… + $wg… variable dance is not necessary for extensions that use extension registration (i. e. extension.json), w" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE) [16:19:22] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:26:01] (03PS1) 10Lucas Werkmeister (WMDE): Stop using wmg variables for Score extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 [16:26:19] (03CR) 10Lucas Werkmeister (WMDE): "> I’ll upload a separate change to convert the other Score settings for consistency." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE) [16:27:28] (03CR) 10Lucas Werkmeister (WMDE): "I think I can test this on mwdebug1002 with" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (owner: 10Lucas Werkmeister (WMDE)) [16:31:12] !log bootstrapping cassandra-a, restbase2019 -- T208087 [16:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:16] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [16:34:13] (03PS1) 10Mobrovac: service::node: Allow the world to read the config [puppet] - 10https://gerrit.wikimedia.org/r/502270 (https://phabricator.wikimedia.org/T207143) [16:35:39] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [16:35:53] (03PS2) 10Mobrovac: service::node: Allow the world to read the config [puppet] - 10https://gerrit.wikimedia.org/r/502270 (https://phabricator.wikimedia.org/T207143) [16:41:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] service::node: Allow the world to read the config [puppet] - 10https://gerrit.wikimedia.org/r/502270 (https://phabricator.wikimedia.org/T207143) (owner: 10Mobrovac) [16:44:32] 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) [16:45:09] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) @WMDE-leszek Hi, sorry for not answering any sooner, last few weeks have been crazy indeed. Q4/Q2 started We can start work on... [16:45:43] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10jcrespo) [16:47:05] (03Abandoned) 10CRusnov: Take a code-quality pass [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/499515 (owner: 10CRusnov) [16:47:51] (03PS3) 10Mholloway: WikimediaEditorTasks: Replace needed Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501845 (https://phabricator.wikimedia.org/T220153) [16:51:26] (03CR) 10Mholloway: [C: 03+2] WikimediaEditorTasks: Replace needed Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501845 (https://phabricator.wikimedia.org/T220153) (owner: 10Mholloway) [16:52:10] 10Operations, 10ops-eqiad, 10DBA: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) Just to be clear, @Cmjohnson, of what we discussed on the meeting, for eqiad, this one (dbprov) is more important for us than those that arrived recen... [16:52:26] (03Merged) 10jenkins-bot: WikimediaEditorTasks: Replace needed Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501845 (https://phabricator.wikimedia.org/T220153) (owner: 10Mholloway) [16:54:26] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 3 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10Eevans) [16:55:54] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Replace needed WikimediaEditorTasks Beta Cluster config (T220153) (duration: 00m 58s) [16:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:57] T220153: Beta Wikidata: Error: 1146 Table 'wikishared.wikimedia_editor_tasks_keys' doesn't exist - https://phabricator.wikimedia.org/T220153 [16:56:56] (03PS6) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [16:57:43] 10Operations, 10Core Platform Team, 10Release Pipeline, 10Release-Engineering-Team, and 4 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Lydia_Pintscher) [16:59:04] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): cloudnet2002-dev: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10aborrero) [16:59:33] (03CR) 10jenkins-bot: WikimediaEditorTasks: Replace needed Beta Cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501845 (https://phabricator.wikimedia.org/T220153) (owner: 10Mholloway) [16:59:35] !log mobrovac@deploy1001 Started deploy [mobileapps/deploy@64f09a0]: Force-deploy to scb1001 to test the config perms [16:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:50] !log mobrovac@deploy1001 Finished deploy [mobileapps/deploy@64f09a0]: Force-deploy to scb1001 to test the config perms (duration: 00m 16s) [16:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] gehel and onimisionipe: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190408T1700). [17:00:18] here here [17:02:09] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10jcrespo) [17:03:10] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@c30a540]: GUI updates, Updater with redirect fix and Blazegraph with XSS fix [17:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:38] (03CR) 10Gehel: [C: 04-1] "A few more minor comment. Overall, this looks pretty good, this is mostly style." (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [17:05:15] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.16.98:9042 on restbase2019 is CRITICAL: connect to address 10.192.16.98 and port 9042: Connection refused eevans Bootstrapping instances. https://phabricator.wikimedia.org/T93886 [17:05:15] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.16.99:9042 on restbase2019 is CRITICAL: connect to address 10.192.16.99 and port 9042: Connection refused eevans Bootstrapping instances. https://phabricator.wikimedia.org/T93886 [17:05:15] ACKNOWLEDGEMENT - cassandra-b SSL 10.192.16.99:7001 on restbase2019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Bootstrapping instances. https://phabricator.wikimedia.org/T120662 [17:05:15] ACKNOWLEDGEMENT - cassandra-b service on restbase2019 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed eevans Bootstrapping instances. [17:05:15] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.16.100:9042 on restbase2019 is CRITICAL: connect to address 10.192.16.100 and port 9042: Connection refused eevans Bootstrapping instances. https://phabricator.wikimedia.org/T93886 [17:05:15] ACKNOWLEDGEMENT - cassandra-c SSL 10.192.16.100:7001 on restbase2019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Bootstrapping instances. https://phabricator.wikimedia.org/T120662 [17:05:15] ACKNOWLEDGEMENT - cassandra-c service on restbase2019 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans Bootstrapping instances. [17:06:19] (03PS1) 10Mobrovac: service::node: Ensure the deployment user can read and write the config [puppet] - 10https://gerrit.wikimedia.org/r/502273 (https://phabricator.wikimedia.org/T207143) [17:07:10] (03CR) 10Alexandros Kosiaris: [C: 03+2] service::node: Ensure the deployment user can read and write the config [puppet] - 10https://gerrit.wikimedia.org/r/502273 (https://phabricator.wikimedia.org/T207143) (owner: 10Mobrovac) [17:08:26] (03PS1) 10Arturo Borrero Gonzalez: labtestmetal2001: repurpose as clouddb2001-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/502274 (https://phabricator.wikimedia.org/T220129) [17:12:18] (03PS3) 10BBlack: Add CNAME-variant langlist template [dns] - 10https://gerrit.wikimedia.org/r/501628 (https://phabricator.wikimedia.org/T208263) [17:12:20] (03PS3) 10BBlack: wiktionary: test with zone-local CNAME->DYNA [dns] - 10https://gerrit.wikimedia.org/r/501629 (https://phabricator.wikimedia.org/T208263) [17:14:27] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@c30a540]: GUI updates, Updater with redirect fix and Blazegraph with XSS fix (duration: 11m 17s) [17:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:38] (03CR) 10CRusnov: "Thank you as always for the review!" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [17:15:59] (03CR) 10CRusnov: "I think for the time being, seeing as this does what's expected, too much style lawyering is unnecessary. I don't see a really clean way t" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/498268 (owner: 10CRusnov) [17:20:04] (03CR) 10CRusnov: "> Patch Set 12:" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [17:22:35] (03CR) 10Jforrester: Stop using wmg variables for Score extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (owner: 10Lucas Werkmeister (WMDE)) [17:23:21] (03PS2) 10Bstorm: cloudstore: A bit more cleanup [puppet] - 10https://gerrit.wikimedia.org/r/501446 (https://phabricator.wikimedia.org/T209527) [17:30:12] (03CR) 10Lucas Werkmeister (WMDE): Stop using wmg variables for Score extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (owner: 10Lucas Werkmeister (WMDE)) [17:30:24] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10BBlack) The wiktionary CNAME experiment is going out today, and I'm intending to keep it running for at least a... [17:30:45] (03CR) 10BBlack: [C: 03+2] Add CNAME-variant langlist template [dns] - 10https://gerrit.wikimedia.org/r/501628 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [17:30:50] (03PS1) 10Arturo Borrero Gonzalez: labtestmetal2001: rename to clouddb2001-dev.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/502283 (https://phabricator.wikimedia.org/T220129) [17:32:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestmetal2001: rename to clouddb2001-dev.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/502283 (https://phabricator.wikimedia.org/T220129) (owner: 10Arturo Borrero Gonzalez) [17:32:11] (03PS2) 10Arturo Borrero Gonzalez: labtestmetal2001: rename to clouddb2001-dev.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/502283 (https://phabricator.wikimedia.org/T220129) [17:33:24] (03PS2) 10Bstorm: labs cumin: Allow running nfs_hostlist script outside a puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/502219 (https://phabricator.wikimedia.org/T219421) (owner: 10Alex Monk) [17:33:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestmetal2001: repurpose as clouddb2001-dev.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/502274 (https://phabricator.wikimedia.org/T220129) (owner: 10Arturo Borrero Gonzalez) [17:33:56] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) [17:34:37] (03PS3) 10Bstorm: labs cumin: Allow running nfs_hostlist script outside a puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/502219 (https://phabricator.wikimedia.org/T219421) (owner: 10Alex Monk) [17:35:40] (03CR) 10Bstorm: [C: 03+2] labs cumin: Allow running nfs_hostlist script outside a puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/502219 (https://phabricator.wikimedia.org/T219421) (owner: 10Alex Monk) [17:40:28] (03PS1) 10Arturo Borrero Gonzalez: clouddb2001-dev: use a non-cloudvirt-specific partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/502285 (https://phabricator.wikimedia.org/T220129) [17:41:02] (03PS12) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [17:41:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] clouddb2001-dev: use a non-cloudvirt-specific partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/502285 (https://phabricator.wikimedia.org/T220129) (owner: 10Arturo Borrero Gonzalez) [17:42:04] (03CR) 10jerkins-bot: [V: 04-1] labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [17:42:39] !log add swift term to cr1/2-eqiad - T220081 [17:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:43] T220081: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 [17:43:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "I just noticed InitialiseSettings-labs.php needs to be updated as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (owner: 10Lucas Werkmeister (WMDE)) [17:43:49] 10Operations, 10Analytics, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10ayounsi) 05Open→03Resolved Done, please reopen if any issue. [17:43:52] 10Operations, 10Analytics, 10Discovery, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10ayounsi) [17:43:59] (03PS13) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [17:44:57] (03CR) 10jerkins-bot: [V: 04-1] labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [17:45:41] PROBLEM - Check systemd state on restbase2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:48:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Stop using wmg variables for Score extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (owner: 10Lucas Werkmeister (WMDE)) [17:49:13] (03PS14) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) [17:49:21] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin... [17:50:18] !log T220129 renaming labtestmetal2001.codfw.wmnet to clouddb2001-dev.codfw.wmnet [17:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:21] T220129: labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 [17:54:43] (03PS2) 10Lucas Werkmeister (WMDE): Stop using wmg variables for Score extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 [17:55:04] (03CR) 10BBlack: [C: 03+2] wiktionary: test with zone-local CNAME->DYNA [dns] - 10https://gerrit.wikimedia.org/r/501629 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [17:55:23] (03PS4) 10BBlack: wiktionary: test with zone-local CNAME->DYNA [dns] - 10https://gerrit.wikimedia.org/r/501629 (https://phabricator.wikimedia.org/T208263) [17:59:07] (03PS1) 10Ottomata: Enable eventgate-analytics api-request logging for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502292 (https://phabricator.wikimedia.org/T214080) [17:59:18] (03CR) 10Bstorm: [C: 03+2] labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501694 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [17:59:51] (03PS7) 10CRusnov: Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 [18:00:04] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190408T1800) [18:00:04] Lucas_WMDE: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:37] o/ [18:00:41] I think my patch is still under discussion [18:00:54] (https://gerrit.wikimedia.org/r/502269) [18:03:54] (03CR) 10CRusnov: "> Patch Set 6: Code-Review-1" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [18:04:00] (03PS1) 10Ottomata: eventgate-analytics - precache /test/event/0.0.3 schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/502293 [18:04:24] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - precache /test/event/0.0.3 schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/502293 (owner: 10Ottomata) [18:05:27] (03CR) 10Volans: [C: 04-1] "Was this tested on the af-netbox test instance? I don't think it runs." (033 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/498268 (owner: 10CRusnov) [18:06:25] !log mobrovac@deploy1001 Started deploy [restbase/deploy@9cf5364]: Lower AQS rate limits and fix recommendation-api spec - T219910 T220221 [18:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:30] (03CR) 10CRusnov: "> Patch Set 4: Code-Review-1" (033 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/498268 (owner: 10CRusnov) [18:06:31] T219910: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 [18:06:31] T220221: Recommendation API end point has disappeared after the upgrade - https://phabricator.wikimedia.org/T220221 [18:09:16] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [18:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:19] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [18:09:19] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:45] just out of interest, is any SWATer around anyways? [18:10:24] because I don’t need to wait for comments on my patch if the SWAT isn’t happening [18:10:51] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-codfw-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [18:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:15] (03PS1) 10Arturo Borrero Gonzalez: clouddb2001-dev: cleanup FQDNs [dns] - 10https://gerrit.wikimedia.org/r/502295 (https://phabricator.wikimedia.org/T220129) [18:12:38] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-codfw-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [18:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:42] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [18:12:42] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:59] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-eqiad-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [18:13:00] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [18:13:00] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:26] anyboding swatting? [18:18:30] I have one to do [18:18:36] haven't added it yet [18:18:40] ottomata: welcome to the club :P [18:19:08] i have swat powers, but i haven't done the full procedure for others before [18:19:15] Lucas_WMDE: if it is just you and me and they are just config changes [18:19:16] i could do [18:19:32] ottomata: I could deploy my own config change too, that’s not the problem [18:19:39] but in my case the patch is still under discussion anyways [18:19:43] oh ok [18:19:46] ok, i'm going to do mine then. [18:19:52] okay [18:20:01] I’ll wait around a bit longer, see if anyone responds on Gerrit [18:20:40] (03CR) 10Ottomata: [C: 03+2] Enable eventgate-analytics api-request logging for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502292 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [18:21:11] PROBLEM - pdfrender on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 5252: Connection refused https://phabricator.wikimedia.org/T174916 [18:22:09] PROBLEM - Host clouddb2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [18:23:35] RECOVERY - Host clouddb2001-dev is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [18:24:30] !log restart pdfrender on scb2001 - T174916 [18:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:35] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [18:25:17] RECOVERY - pdfrender on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.074 second response time https://phabricator.wikimedia.org/T174916 [18:27:07] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['clouddb2001-dev.codf... [18:27:07] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable eventgate-analytics api-request logging for group0 wikis - T214080 (duration: 00m 56s) [18:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:11] T214080: Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate - https://phabricator.wikimedia.org/T214080 [18:27:40] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@9cf5364]: Lower AQS rate limits and fix recommendation-api spec - T219910 T220221 (duration: 21m 14s) [18:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:44] T219910: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 [18:27:45] T220221: Recommendation API end point has disappeared after the upgrade - https://phabricator.wikimedia.org/T220221 [18:28:20] (03CR) 10jenkins-bot: Enable eventgate-analytics api-request logging for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502292 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [18:29:26] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) [18:31:08] !log deploying wiktionary CNAME experiment - https://phabricator.wikimedia.org/T208263#5094712 [18:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:18] (03CR) 10BryanDavis: "> Here is a diagram with an alternative approach for you to consider:" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [18:33:47] (03PS3) 10Bstorm: cloudstore: A bit more cleanup [puppet] - 10https://gerrit.wikimedia.org/r/501446 (https://phabricator.wikimedia.org/T209527) [18:34:17] (03CR) 10Ladsgroup: Stop using wmg variables for Score extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (owner: 10Lucas Werkmeister (WMDE)) [18:36:15] (03CR) 10Bstorm: [C: 03+2] cloudstore: A bit more cleanup [puppet] - 10https://gerrit.wikimedia.org/r/501446 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:37:10] (03CR) 10Alexandros Kosiaris: [C: 03+2] profile::puppetmaster::frontend: Allow getting allow_from from hiera [puppet] - 10https://gerrit.wikimedia.org/r/502235 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [18:37:18] (03PS2) 10Alexandros Kosiaris: profile::puppetmaster::frontend: Allow getting allow_from from hiera [puppet] - 10https://gerrit.wikimedia.org/r/502235 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [18:37:23] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] profile::puppetmaster::frontend: Allow getting allow_from from hiera [puppet] - 10https://gerrit.wikimedia.org/r/502235 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [18:38:01] (03CR) 10Lucas Werkmeister (WMDE): Stop using wmg variables for Score extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (owner: 10Lucas Werkmeister (WMDE)) [18:42:07] (03PS3) 10Bstorm: dynamicproxy: Prevent STS header from non-TLS connections [puppet] - 10https://gerrit.wikimedia.org/r/499669 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [18:43:51] (03CR) 10Bstorm: [C: 03+2] dynamicproxy: Prevent STS header from non-TLS connections [puppet] - 10https://gerrit.wikimedia.org/r/499669 (https://phabricator.wikimedia.org/T102367) (owner: 10BryanDavis) [18:45:22] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 4 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10mobrovac) [18:45:39] I don’t think my config change is going to happen tonight [18:45:44] ottomata: you’re done, right? [18:45:48] yup am done [18:45:50] thanks [18:45:52] okay [18:45:56] !log Morning SWAT done [18:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:17] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 2 others: TEC3:O3:O3.1:Q4 Goal - Move cpjobqueue, Wikidata Termbox SSR (new service), Kask (session storage service) and ORES (partially) through the production CD Pipeline - https://phabricator.wikimedia.org/T220398 (10mobrovac) [18:46:41] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops: Migrate ORES to kubernetes - https://phabricator.wikimedia.org/T220400 (10mobrovac) [18:47:34] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 3 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10mobrovac) [18:49:16] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 4 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10mobrovac) [18:51:33] (03CR) 10Ladsgroup: Stop using wmg variables for Score extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 (owner: 10Lucas Werkmeister (WMDE)) [18:52:22] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10mobrovac) [18:53:36] (03PS1) 10Bstorm: Revert "dynamicproxy: Prevent STS header from non-TLS connections" [puppet] - 10https://gerrit.wikimedia.org/r/502308 [18:54:12] (03CR) 10Bstorm: [V: 03+2 C: 03+2] "Quick revert needed" [puppet] - 10https://gerrit.wikimedia.org/r/502308 (owner: 10Bstorm) [18:54:44] PROBLEM - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: connect to address tools.wmflabs.org and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [18:55:27] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Kanban (Done with CPT), and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10mobrovac) [18:55:30] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10mobrovac) 05Open→03Stalled Stalling as per Alex' and Filippo's... [18:55:54] PROBLEM - HTTPS-wmflabs on tools.wmflabs.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/tag/toolforge/ [18:57:58] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1043 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [18:58:06] RECOVERY - HTTPS-wmflabs on tools.wmflabs.org is OK: SSL OK - Certificate *.wmflabs.org valid until 2019-11-16 15:41:05 +0000 (expires in 221 days) https://phabricator.wikimedia.org/tag/toolforge/ [19:03:43] 10Operations, 10Analytics, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10dr0ptp4kt) Thanks. Confirmed it works. [19:07:37] (03PS1) 10Bstorm: kube2proxy: fix type in the apt::repo bit [puppet] - 10https://gerrit.wikimedia.org/r/502315 [19:09:37] 10Operations, 10Analytics, 10Discovery, 10Research: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10CDanis) So it sounds like the firewall work is done (thanks Arzhel!) Seems like the next thing is to create a Swift container for this usage -- and maybe one just fo... [19:12:45] (03PS2) 10Bstorm: kube2proxy: fix typo in the apt::repo bit [puppet] - 10https://gerrit.wikimedia.org/r/502315 [19:13:46] (03CR) 10Bstorm: [C: 03+2] kube2proxy: fix typo in the apt::repo bit [puppet] - 10https://gerrit.wikimedia.org/r/502315 (owner: 10Bstorm) [19:14:52] (03PS1) 10Thcipriani: Train: scap clean, feature flag prune branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502316 (https://phabricator.wikimedia.org/T218783) [19:20:49] (03CR) 10Andrew Bogott: "I turn out to need something like this for cloudvirt1024 as well:" [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [19:23:52] (03CR) 10Bstorm: "Currently has a rebase conflict. I'll need to rebase locally to get it to merge." [puppet] - 10https://gerrit.wikimedia.org/r/499029 (https://phabricator.wikimedia.org/T162570) (owner: 10BryanDavis) [19:26:58] (03PS2) 10Bstorm: toolforge: add python-bs4 package [puppet] - 10https://gerrit.wikimedia.org/r/499029 (https://phabricator.wikimedia.org/T162570) (owner: 10BryanDavis) [19:31:58] (03CR) 10Bstorm: [C: 03+2] toolforge: add python-bs4 package [puppet] - 10https://gerrit.wikimedia.org/r/499029 (https://phabricator.wikimedia.org/T162570) (owner: 10BryanDavis) [19:34:54] (03PS1) 10Andrew Bogott: cloudvirts: pool cloudvirt1008 and 1009 [puppet] - 10https://gerrit.wikimedia.org/r/502324 [19:35:19] !log starting promotion of 1.33.0-wmf.24 to group1 [19:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:50] (03PS1) 10Dduvall: group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502328 [19:36:52] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502328 (owner: 10Dduvall) [19:37:57] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502328 (owner: 10Dduvall) [19:38:11] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirts: pool cloudvirt1008 and 1009 [puppet] - 10https://gerrit.wikimedia.org/r/502324 (owner: 10Andrew Bogott) [19:39:53] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.24 [19:40:57] dduvall@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:41:15] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.2 [19:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:22] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.24 [19:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:41] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.24 (duration: 01m 46s) [19:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:26] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502328 (owner: 10Dduvall) [19:47:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10Andrew) [19:48:18] !log promoting 1.33.0-wmf.24 to all wikis. cc: T220037, T206678 [19:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:23] T206678: 1.33.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T206678 [19:48:24] T220037: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 [19:48:37] (03PS1) 10Dduvall: all wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502331 [19:48:39] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502331 (owner: 10Dduvall) [19:49:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Update label and switch to rename labvirt1008 to cloudvirt1008 - https://phabricator.wikimedia.org/T220443 (10Andrew) [19:49:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Update label and switch to rename labvirt1008 to cloudvirt1008 - https://phabricator.wikimedia.org/T220443 (10Andrew) a:05RobH→03Cmjohnson [19:49:51] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502331 (owner: 10Dduvall) [19:50:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10Andrew) [19:51:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [19:51:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1009 with 10G interfaces - https://phabricator.wikimedia.org/T216324 (10Andrew) 05Open→03Resolved fixed, pooled, working! [19:51:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10Andrew) 05Open→03Resolved [19:51:38] 10Operations, 10ops-eqiad, 10DC-Ops: relocate/reimage cloudvirt1015 with 10G interfaces - https://phabricator.wikimedia.org/T217140 (10Andrew) 05Open→03Resolved [19:51:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [19:51:44] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:51:50] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.24 [19:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:02] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) [19:52:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1018 with 10G interfaces - https://phabricator.wikimedia.org/T217347 (10Andrew) 05Open→03Resolved a:03Andrew [19:52:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [19:52:33] 10Operations, 10Core Platform Team Kanban, 10MediaWiki-General-or-Unknown, 10serviceops, and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) [19:52:54] marxarelli: btw i'm around if there are more jobqueue issues that arise during the train [19:53:12] mobrovac: right on. thanks! so far so good [19:53:13] 10Operations, 10Core Platform Team Backlog, 10MediaWiki-General-or-Unknown, 10serviceops, and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) [19:53:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [19:53:34] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) [19:53:41] marxarelli: All looking good? *loads* of timeouts in Title.php (which is surprising) but overall it's quietish. [19:53:46] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:54:06] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10kchapman) @Joe did you get what you needed from CPT in IRC? If... [19:54:32] PROBLEM - HHVM rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:54:53] James_F: afaik, timeouts can happen anywhere? beyond ensuring they dissipate after ~ 10 mins, i more or less ignore them [19:55:00] PROBLEM - Nginx local proxy to apache on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:55:56] marxarelli: They "can", but they previously almost always happened in LuaSandbox or the Parser (which have deep loops). [19:56:58] RECOVERY - HHVM rendering on mw1341 is OK: HTTP OK: HTTP/1.1 200 OK - 79515 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:57:25] marxarelli: Let's declare it done. [19:57:28] RECOVERY - Nginx local proxy to apache on mw1341 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:57:30] RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:57:32] hmm, i wan't aware of that. i had assumed they were due to hhvm bytecode cache, and would happen during JIT compilation [19:57:37] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502331 (owner: 10Dduvall) [19:57:44] James_F: yeah, seems good to me [19:57:49] Yay. [19:59:26] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:59:29] !log promotion of 1.33.0-wmf.24 to all wikis completed. error rates nominal aside from usual timeouts. cc: T206678, T220037 [19:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:36] T206678: 1.33.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T206678 [19:59:37] T220037: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: (Dis)respected human, time to deploy Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190408T2000). Please do the needful. [20:05:49] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@c7fa522]: Update mobileapps to cdb9928 (T220045 T219411 T219667) [20:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:03] T219411: External links no longer have arrow icon in Light theme. - https://phabricator.wikimedia.org/T219411 [20:06:03] T219667: Add wikibase entity id for image files to media endpoint - https://phabricator.wikimedia.org/T219667 [20:06:04] T220045: Stop getting base CSS from live ResourceLoader requests - https://phabricator.wikimedia.org/T220045 [20:07:59] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@c7fa522]: Update mobileapps to cdb9928 (T220045 T219411 T219667) (duration: 02m 10s) [20:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:58] !log mobileapps deploy failed on canary (Check 'endpoints' failed). Rolled back canary. [20:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:20] 10Operations, 10Fundraising-Backlog, 10Mail, 10fundraising-tech-ops: Identify appropriate SPF record for domain wikimediafoundation.org - https://phabricator.wikimedia.org/T220412 (10Jgreen) As far as I know, fundraising does not send mail using this domain but only from wikimedia.org, so I don't think our... [20:10:42] no parsoid deploy today [20:24:50] !log bootstrapping cassandra-b, restbase2019 -- T208087 [20:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:58] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [20:29:15] PROBLEM - puppet last run on an-worker1080 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:37:39] (03PS2) 10Bstorm: postgresql: set max_wal_senders on slave conf [puppet] - 10https://gerrit.wikimedia.org/r/501384 (https://phabricator.wikimedia.org/T219652) [20:41:07] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@c7fa522]: Update mobileapps to cdb9928 (T220045 T219411 T219667) [20:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:13] T219411: External links no longer have arrow icon in Light theme. - https://phabricator.wikimedia.org/T219411 [20:41:13] T219667: Add wikibase entity id for image files to media endpoint - https://phabricator.wikimedia.org/T219667 [20:41:14] T220045: Stop getting base CSS from live ResourceLoader requests - https://phabricator.wikimedia.org/T220045 [20:43:51] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/data/css/mobile/base (Get base CSS) is CRITICAL: Could not fetch url http://10.192.32.132:8888/en.wikipedia.org/v1/data/css/mobile/base: Generic connection error: HTTPConnectionPool(host=u10.192.32.132, port=8888): Max retries exceeded with url: /en.wikipedia.org/v1/data/css/mobile/base (Caused by ProtocolError(Connection aborted., BadStatusLine(,))) https [20:43:51] edia.org/wiki/Services/Monitoring/mobileapps [20:49:01] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@c7fa522]: Update mobileapps to cdb9928 (T220045 T219411 T219667) (duration: 07m 55s) [20:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:08] T219411: External links no longer have arrow icon in Light theme. - https://phabricator.wikimedia.org/T219411 [20:49:08] T219667: Add wikibase entity id for image files to media endpoint - https://phabricator.wikimedia.org/T219667 [20:49:08] T220045: Stop getting base CSS from live ResourceLoader requests - https://phabricator.wikimedia.org/T220045 [20:50:17] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:51:09] PROBLEM - HHVM jobrunner on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:51:59] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:52:27] RECOVERY - HHVM jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:53:15] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:55:29] RECOVERY - puppet last run on an-worker1080 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:59:41] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable Homepage logging on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502338 [21:00:05] bawolff and Reedy: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190408T2100). [21:14:01] (03PS1) 10Dr0ptp4kt: Add dr0ptp4kt to gpu-testers [puppet] - 10https://gerrit.wikimedia.org/r/502341 (https://phabricator.wikimedia.org/T148843) [21:15:05] (03CR) 10jerkins-bot: [V: 04-1] Add dr0ptp4kt to gpu-testers [puppet] - 10https://gerrit.wikimedia.org/r/502341 (https://phabricator.wikimedia.org/T148843) (owner: 10Dr0ptp4kt) [21:20:19] (03PS2) 10Dr0ptp4kt: Add dr0ptp4kt to gpu-testers [puppet] - 10https://gerrit.wikimedia.org/r/502341 (https://phabricator.wikimedia.org/T148843) [21:25:18] (03CR) 10Catrope: [C: 04-1] "Why is this needed? It's already enabled for all wikis in InitialiseSettings.php, that should carry over to beta labs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502338 (owner: 10Kosta Harlan) [21:26:28] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10dr0ptp4kt) Hi, I'm requesting access to gpu-testers as well in order t... [21:37:18] (03PS1) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527) [21:38:17] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [21:38:19] (03PS5) 10CRusnov: Add synchronizing nodes to ganeti-netbox sync. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/498268 [21:38:32] (03PS2) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527) [21:39:10] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [21:41:00] (03CR) 10CRusnov: "Thanks for catching those. This has been tested on af-netbox01 now." (033 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/498268 (owner: 10CRusnov) [21:52:35] (03PS3) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527) [21:53:48] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Revoke production prometheus fundraising access - https://phabricator.wikimedia.org/T217355 (10cwdent) [21:54:48] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Revoke production prometheus fundraising access - https://phabricator.wikimedia.org/T217355 (10cwdent) @ayounsi there is a new config at -1554758904, removing prod prometheus and grafana access to pay-lvs servers. [21:57:10] (03PS1) 10Bstorm: labstore: a touch more cleanup of the secondary modules [puppet] - 10https://gerrit.wikimedia.org/r/502344 (https://phabricator.wikimedia.org/T209527) [21:59:05] (03CR) 10Bstorm: [C: 03+2] labstore: a touch more cleanup of the secondary modules [puppet] - 10https://gerrit.wikimedia.org/r/502344 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:00:03] (03PS1) 10Jgreen: flip payments.wikimedia.org back to codfw cluster [dns] - 10https://gerrit.wikimedia.org/r/502345 [22:00:59] !log pfw firewall rules update - T217355 [22:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:04] T217355: Revoke production prometheus fundraising access - https://phabricator.wikimedia.org/T217355 [22:01:25] PROBLEM - HHVM jobrunner on mw1310 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:02:39] RECOVERY - HHVM jobrunner on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:04:57] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/WikibaseMediaInfo/src/WikibaseMediaInfoHooks.php: WBMI T220277 (duration: 00m 57s) [22:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:01] T220277: SDC file page JS doesn't see existing captions in some circumstances - https://phabricator.wikimedia.org/T220277 [22:06:51] (03CR) 10Jgreen: [C: 03+2] flip payments.wikimedia.org back to codfw cluster [dns] - 10https://gerrit.wikimedia.org/r/502345 (owner: 10Jgreen) [22:18:41] !log enable sampling on eqiad Telia transit link [22:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:08] !log enable sampling on cr2-eqiad external links [22:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:34] (03CR) 10Gehel: [C: 04-1] Netbox module for Spicerack (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [22:41:49] PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100% [22:44:59] :S [22:45:14] !log rollback enable sampling on cr2-eqiad external links [22:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:46] oh right, 2048 was the one that lost it's disk and had it replaced. Thought it would have been silenced [22:52:59] (03PS13) 10EBernhardson: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [22:54:03] (03PS14) 10EBernhardson: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [22:54:05] (03PS7) 10EBernhardson: Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) [22:54:07] (03PS7) 10EBernhardson: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) [22:58:21] (03CR) 10Gehel: "minor style issues, but looks pretty good to me!" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190408T2300). [23:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:04:33] just me, i'll ship things and hopefully not break the world [23:05:21] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:06:24] (03Merged) 10jenkins-bot: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:08:57] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10faidon) I think this is addressed by systemd's [[ https://github.com/systemd/systemd/commit/9009d3b5c3b6d191be69215736be77583e0f23f9 | 9009d3b5c3b6d191be69215736be77583e0f23... [23:10:07] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:10:25] !log ebernhardson@deploy1001 Synchronized wmf-config/: T218954: Disable wbcs dispatching query builder on commons (1/3) (duration: 00m 52s) [23:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:31] T218954: Default to article search on commons + wikibase (aka SDC) - https://phabricator.wikimedia.org/T218954 [23:10:47] (03CR) 10jenkins-bot: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:11:31] (03Merged) 10jenkins-bot: Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:11:33] (03CR) 10jenkins-bot: Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:12:27] SMalyshev: pulled 2/3, which does the actual switch, to mwdebug1002. [23:14:25] not sure what to do with this on scap pull from mwdebug1002: 23:11:43 Opcache invalidation failed. Consider performing it manually. [23:14:33] it needs a link to what that means :P [23:14:46] maybe it's only for php7... [23:15:10] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) >>! In T102367#5056919, @Vgutierrez wrote: > Currently tools.wmflabs.org is violating [[ https://tools.ietf.org/html/rfc6797#sect... [23:15:34] yea, scap code seems to suggest it's only php7 [23:16:47] (03Abandoned) 10Kosta Harlan: GrowthExperiments: Enable Homepage logging on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502338 (owner: 10Kosta Harlan) [23:17:21] SMalyshev: no obvious breakages, syncing unless you see something? [23:21:51] let me see (unless you already merged :) [23:22:33] SMalyshev: merged yes, deployed no :) It's only on mwdebug1002 [23:22:36] ebernhardson: I get a failure trying to do search on commons... not sure if related [23:22:40] SMalyshev: err, mwmaint1002 [23:23:02] mwmaint? [23:23:04] doh, i bet that's why ... silly me should be pulling to mwdebug1002 [23:23:14] 10Operations, 10Parsoid, 10RESTBase, 10VisualEditor, and 5 others: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 (10mobrovac) [23:23:14] the extension does not support it [23:23:17] (i be tthats why opcache invalidation failed) [23:23:42] mwdebug1002 works [23:23:48] sec pulling patch there [23:23:52] ok [23:24:10] SMalyshev: ok it's on mwdebug1002 [23:31:36] can't find anything wrong, going to ship [23:32:30] yeah looks good to me too [23:33:13] 3/3 is the main one though [23:33:33] !log ebernhardson@deploy1001 Synchronized wmf-config/Wikibase.php: T218954: Disable wbcs dispatching query builder on commons (2/3) (duration: 00m 52s) [23:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:37] T218954: Default to article search on commons + wikibase (aka SDC) - https://phabricator.wikimedia.org/T218954 [23:37:05] nothing seems to be broken still [23:37:07] good :) [23:37:14] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:38:12] (03Merged) 10jenkins-bot: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:41:40] !log ebernhardson@deploy1001 Synchronized wmf-config: T218954: Disable wbcs dispatching query builder on commons (3/3) (duration: 00m 51s) [23:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:43] T218954: Default to article search on commons + wikibase (aka SDC) - https://phabricator.wikimedia.org/T218954 [23:43:58] (03CR) 10jenkins-bot: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:45:11] !log ebernhardson@deploy1001 Synchronized wmf-config: T218954: Disable wbcs dispatching query builder on commons (3/3) (duration: 00m 52s) [23:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:08] SMalyshev: hmm, some brief errors from ArticlePlaceholder , ItemNotabilityFilter (which attaches to SpecialSearch): https://test.wikidata.org/w/index.php?search=test&title=Special%3ASearch&fulltext=1&ns0=1&ns120=1 [23:48:15] SMalyshev: any idea if thats related? [23:48:30] (well, by brief i mean it came up one, but rerunning the test page brought it up again) [23:48:33] ebernhardson: I think those happened before [23:49:02] from what I remember, they've been there for a while - check logs back. so I don't think it's related [23:49:04] hmm, yea looking back further in logstash finding a few [23:49:09] so not new :) [23:49:43] ok, so next up $wmgNewWikibaseCirrusSearch=true for commonswiki [23:49:45] commons search still alive [23:49:52] (03PS3) 10EBernhardson: Enable WBCS search on commons too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501436 (https://phabricator.wikimedia.org/T218954) (owner: 10Smalyshev) [23:49:58] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501436 (https://phabricator.wikimedia.org/T218954) (owner: 10Smalyshev) [23:51:12] (03Merged) 10jenkins-bot: Enable WBCS search on commons too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501436 (https://phabricator.wikimedia.org/T218954) (owner: 10Smalyshev) [23:53:12] SMalyshev: live on mwdebug1002. Looks to still work [23:54:56] (03CR) 10jenkins-bot: Enable WBCS search on commons too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501436 (https://phabricator.wikimedia.org/T218954) (owner: 10Smalyshev) [23:55:12] checking [23:55:24] nothng seems broken [23:56:10] but inlabel: does not work [23:56:21] should this be the case? [23:56:55] SMalyshev: hmm, commons shouldn't have any labels yet afaik? [23:57:05] lemme check elastic [23:57:08] yeah but it doesn't even build the query [23:57:11] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T218954: Enable WBCS search on commons too (duration: 00m 50s) [23:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:15] T218954: Default to article search on commons + wikibase (aka SDC) - https://phabricator.wikimedia.org/T218954 [23:57:29] oh wait no now it does [23:57:35] yea i'm seeing labels_all.plain query [23:57:46] so ok then [23:58:07] btw the request from mdholloway was basically a haslabel: query, like +haslabel:en -haslabel:de [23:58:48] SMalyshev: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/501435/ too? [23:59:15] if i remember right this should be a noop equivilant