[00:51:05] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 990.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:14:03] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:43:37] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [02:45:13] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [03:29:22] (03PS1) 10CRusnov: ganeti: Add ability to get ganeti cluster for given instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/533984 (https://phabricator.wikimedia.org/T231068) [04:12:33] (03PS1) 10Dzahn: peopleweb: include envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/533985 (https://phabricator.wikimedia.org/T210411) [04:19:55] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [04:21:02] (03PS1) 10Dzahn: add peopleweb.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/533986 (https://phabricator.wikimedia.org/T210411) [04:22:06] (03CR) 10Dzahn: [C: 03+2] add peopleweb.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/533986 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [04:28:48] !log Switching cp2002 from nginx to ats-tls - T231433 [04:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:50] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:29:47] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp2002 [puppet] - 10https://gerrit.wikimedia.org/r/532984 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:29:56] (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp2002 [puppet] - 10https://gerrit.wikimedia.org/r/532984 (https://phabricator.wikimedia.org/T231433) [04:36:01] !log upgrading ATS to 8.0.5-1wm4 on cp2002 - T231433 [04:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:03] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:37:29] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp2002 [puppet] - 10https://gerrit.wikimedia.org/r/532985 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:37:40] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2002 [puppet] - 10https://gerrit.wikimedia.org/r/532985 (https://phabricator.wikimedia.org/T231433) [04:38:13] PROBLEM - HTTPS Unified ECDSA on cp2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:39:07] PROBLEM - HTTPS Unified RSA on cp2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:44:19] (03PS1) 10CRusnov: netbox: Transparently support read-only operations for virtual machines [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) [04:44:37] RECOVERY - HTTPS Unified ECDSA on cp2002 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345568 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:45:31] RECOVERY - HTTPS Unified RSA on cp2002 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345513 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:47:42] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Joe) I second what @MoritzMuehlenhoff suggested. The system is not scheduled for replacement for another 2 years, so if we can salvage it somehow, that'd be great. [04:49:38] !log repooling cp2002 - T231433 [04:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:41] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:50:06] !Drop filejournal table on s3 - T51195 [04:50:06] T51195: Drop filejournal table from WMF - https://phabricator.wikimedia.org/T51195 [04:50:10] !log Drop filejournal table on s3 - T51195 [04:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:56] (03PS1) 10Vgutierrez: hiera: Remove trailing line on cp2002 yaml file [puppet] - 10https://gerrit.wikimedia.org/r/533988 [04:58:46] (03PS1) 10Marostegui: mariadb: Promote db2118 to s7 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/533989 (https://phabricator.wikimedia.org/T230106) [04:59:57] (03CR) 10Vgutierrez: [C: 03+2] hiera: Remove trailing line on cp2002 yaml file [puppet] - 10https://gerrit.wikimedia.org/r/533988 (owner: 10Vgutierrez) [05:01:43] (03PS1) 10Marostegui: db-codfw.php: Promote db2118 to s7 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533990 (https://phabricator.wikimedia.org/T230106) [05:02:18] !log Promote db2118 to s7 codfw master (db2047 -> db2118) T230106 [05:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:20] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [05:05:29] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2118 to s7 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/533989 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [05:05:40] (03PS2) 10Marostegui: mariadb: Promote db2118 to s7 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/533989 (https://phabricator.wikimedia.org/T230106) [05:12:26] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Promote db2118 to s7 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533990 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [05:12:46] (03CR) 10jenkins-bot: db-codfw.php: Promote db2118 to s7 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533990 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [05:14:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2118 to s7 codfw master (db2047 -> db2118) T230106', diff saved to https://phabricator.wikimedia.org/P9026 and previous config saved to /var/cache/conftool/dbconfig/20190903-051450-marostegui.json [05:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:53] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [05:16:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2047 old master from s7 T230106', diff saved to https://phabricator.wikimedia.org/P9027 and previous config saved to /var/cache/conftool/dbconfig/20190903-051619-marostegui.json [05:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:09] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Promote db2118 to s7 codfw master (db2047 -> db2118) T230106 (duration: 00m 54s) [05:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:49] 10Operations, 10DBA: Drop puppet database from m1 - https://phabricator.wikimedia.org/T231539 (10Marostegui) a:03Marostegui [05:20:50] (03PS1) 10Dzahn: add fake SSL key for peopleweb [labs/private] - 10https://gerrit.wikimedia.org/r/533991 [05:22:01] 10Operations, 10DBA: Drop puppet database from m1 - https://phabricator.wikimedia.org/T231539 (10Marostegui) I have left a backup of this DB at: ` cumin1001:/home/marostegui/T231539 ` [05:22:16] !log Rename tables on the puppet database on m1 master - T231539 [05:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:18] T231539: Drop puppet database from m1 - https://phabricator.wikimedia.org/T231539 [05:24:42] 10Operations, 10DBA: Drop puppet database from m1 - https://phabricator.wikimedia.org/T231539 (10Marostegui) I have renamed the tables on the `puppet` DB, I will leave them for a few hours before dropping the database: ` # mysql.py -hdb1063 puppet -e "show tables" -BN TO_DROP_auth_group TO_DROP_auth_group_perm... [05:27:34] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake SSL key for peopleweb [labs/private] - 10https://gerrit.wikimedia.org/r/533991 (owner: 10Dzahn) [05:35:48] (03PS2) 10Dzahn: peopleweb: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/533985 (https://phabricator.wikimedia.org/T210411) [05:37:34] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.56 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:40:48] (03PS3) 10Dzahn: peopleweb: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/533985 (https://phabricator.wikimedia.org/T210411) [05:44:11] (03PS1) 10Marostegui: db-codfw.php: Reorganize s7 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533993 (https://phabricator.wikimedia.org/T230106) [05:45:32] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Reorganize s7 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533993 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [05:46:30] (03Merged) 10jenkins-bot: db-codfw.php: Reorganize s7 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533993 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [05:46:46] (03CR) 10jenkins-bot: db-codfw.php: Reorganize s7 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533993 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [05:47:49] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Reorganize s7 codfw T230106 (duration: 00m 54s) [05:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:51] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [05:48:26] (03CR) 10Dzahn: [C: 03+2] peopleweb: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/533985 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [05:48:36] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 74.14 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:52:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reorganize s7 codfw T230106', diff saved to https://phabricator.wikimedia.org/P9028 and previous config saved to /var/cache/conftool/dbconfig/20190903-055234-marostegui.json [05:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:18] !log people.wikimedia.org - switching to TLS termination with envoy [05:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:51] 10Operations, 10Traffic: Tune ATS SSL session cache - https://phabricator.wikimedia.org/T231849 (10Vgutierrez) [05:55:45] 10Operations, 10Traffic: Tune ATS SSL session cache - https://phabricator.wikimedia.org/T231849 (10Vgutierrez) p:05Triage→03Normal [05:57:08] (03PS4) 10Giuseppe Lavagetto: wdqs: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532666 [05:57:47] (03PS1) 10Vgutierrez: hiera: Increase ATS SSL session cache capacity to 4M sessions [puppet] - 10https://gerrit.wikimedia.org/r/533994 (https://phabricator.wikimedia.org/T231849) [05:59:03] (03PS1) 10Dzahn: ATS: switch people.wikimedia.org to https backend [puppet] - 10https://gerrit.wikimedia.org/r/533995 (https://phabricator.wikimedia.org/T210411) [06:00:12] (03CR) 10Dzahn: [C: 03+2] ATS: switch people.wikimedia.org to https backend [puppet] - 10https://gerrit.wikimedia.org/r/533995 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [06:00:22] (03PS2) 10Dzahn: ATS: switch people.wikimedia.org to https backend [puppet] - 10https://gerrit.wikimedia.org/r/533995 (https://phabricator.wikimedia.org/T210411) [06:02:15] <_joe_> grrrr [06:02:28] <_joe_> mutante: please let me merge [06:02:41] (03PS5) 10Giuseppe Lavagetto: wdqs: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532666 [06:02:46] <_joe_> this is so fcking slow [06:02:59] _joe_: oh, sure. i will take a break but this one is already too late i'm afraid [06:03:03] <_joe_> ff-only is the choice of people that don't merge much. [06:03:10] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] wdqs: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532666 (owner: 10Giuseppe Lavagetto) [06:03:49] _joe_: want me to type 'multiple' ? [06:04:42] !log Change min_replicas to 4 on s7 for eqiad and codfw T231019 [06:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:45] T231019: set min_replicas on database sections in dbctl - https://phabricator.wikimedia.org/T231019 [06:05:54] <_joe_> mutante: please do, I was getting errors from puppet merge :P [06:06:14] _joe_: merged! [06:06:22] well.. in the process of merging [06:06:34] <_joe_> eheh it's a long process nowadays [06:06:44] <_joe_> I never looked into what's making it so damn slow [06:06:58] i just know that labs/private is now part of it [06:07:05] done! [06:10:43] !log running puppet on cp-text_eqiad to switch people.wm.org to https backend [06:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:45] (03PS1) 10Dzahn: puppet/site: fail if no role has been assigned to a node [puppet] - 10https://gerrit.wikimedia.org/r/534005 [06:21:49] (03CR) 10jerkins-bot: [V: 04-1] puppet/site: fail if no role has been assigned to a node [puppet] - 10https://gerrit.wikimedia.org/r/534005 (owner: 10Dzahn) [06:22:38] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [06:23:12] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [06:23:51] (03PS2) 10Dzahn: puppet/site: fail if no role has been assigned to a node [puppet] - 10https://gerrit.wikimedia.org/r/534005 [06:24:32] (03CR) 10jerkins-bot: [V: 04-1] puppet/site: fail if no role has been assigned to a node [puppet] - 10https://gerrit.wikimedia.org/r/534005 (owner: 10Dzahn) [06:28:05] (03PS3) 10Dzahn: puppet/site: fail if no role has been assigned to a node [puppet] - 10https://gerrit.wikimedia.org/r/534005 [06:30:04] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [06:31:19] 10Operations, 10DBA: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 (10Marostegui) [06:31:34] 10Operations, 10DBA: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 (10Marostegui) p:05Triage→03Normal [06:31:54] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [06:34:46] (03PS1) 10Marostegui: db2047: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534010 (https://phabricator.wikimedia.org/T231852) [06:35:01] (03CR) 10Vgutierrez: [C: 03+2] Release 0.21 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/533856 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [06:35:34] (03CR) 10Marostegui: [C: 03+2] db2047: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534010 (https://phabricator.wikimedia.org/T231852) (owner: 10Marostegui) [06:36:14] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 (10Marostegui) [06:39:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1133 with weight 0 T229657', diff saved to https://phabricator.wikimedia.org/P9029 and previous config saved to /var/cache/conftool/dbconfig/20190903-063932-marostegui.json [06:39:35] marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [06:39:36] T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 [06:40:26] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [06:41:23] (03PS5) 10Marostegui: wmnet: Promote db1133 to m5 master [dns] - 10https://gerrit.wikimedia.org/r/529333 (https://phabricator.wikimedia.org/T229657) [06:53:57] (03CR) 10jenkins-bot: Release 0.21 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/533856 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [06:59:50] (03CR) 10Dzahn: "gentle ping. we said it needs discussion back in June. that's been a couple months." [puppet] - 10https://gerrit.wikimedia.org/r/510753 (https://phabricator.wikimedia.org/T223463) (owner: 10Rush) [07:00:43] (03PS3) 10Dzahn: Remove wikimedia.ee [dns] - 10https://gerrit.wikimedia.org/r/459835 (https://phabricator.wikimedia.org/T204056) (owner: 10Reedy) [07:01:17] (03CR) 10Dzahn: [C: 04-1] "not stalled anymore now - but also rebased to nothing because it was already done by somebody else now" [dns] - 10https://gerrit.wikimedia.org/r/459835 (https://phabricator.wikimedia.org/T204056) (owner: 10Reedy) [07:02:18] (03Abandoned) 10Dzahn: Remove wikimedia.ee [dns] - 10https://gerrit.wikimedia.org/r/459835 (https://phabricator.wikimedia.org/T204056) (owner: 10Reedy) [07:10:24] 10Operations, 10MediaWiki-Maintenance-scripts, 10serviceops: Stop forcing RUNNER=php for foreachwiki/foreachwikiindblist - https://phabricator.wikimedia.org/T230110 (10Dzahn) Kind of, yea. But also we should revert setting it to php7.2 now to avoid hardcoding the version. [07:13:17] (03PS1) 10Dzahn: switch RUNNER in foreachwikiindblist back to just 'php' [puppet] - 10https://gerrit.wikimedia.org/r/534012 (https://phabricator.wikimedia.org/T230110) [07:15:14] (03CR) 10Nuria: [C: 03+1] analytics::refinery::job::data_purge.pp Add skip-trash to timers [puppet] - 10https://gerrit.wikimedia.org/r/533955 (https://phabricator.wikimedia.org/T229436) (owner: 10Mforns) [07:16:47] !log Change min_replicas to 6 on s1 for eqiad and codfw T231019 [07:18:56] (03CR) 10Muehlenhoff: [C: 04-1] "This will need some sync up with DC as TTBOMK they currently install the servers without the initial role attached. But there's also a few" [puppet] - 10https://gerrit.wikimedia.org/r/534005 (owner: 10Dzahn) [07:20:21] (03PS1) 10Marostegui: db-codfw.php: Re-organize s1 codfw candidate masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534014 (https://phabricator.wikimedia.org/T230106) [07:20:56] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10Nuria) Closing as turnilo is indeed sufficient to gather the info requested [07:20:58] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10Nuria) 05Open→03Resolved [07:22:08] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:22:22] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:23:00] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:23:00] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:23:04] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:23:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:23:30] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:23:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:24:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:24:08] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [07:24:10] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:24:25] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Re-organize s1 codfw candidate masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534014 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:24:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:25:19] (03Merged) 10jenkins-bot: db-codfw.php: Re-organize s1 codfw candidate masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534014 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:26:23] (03CR) 10jenkins-bot: db-codfw.php: Re-organize s1 codfw candidate masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534014 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:26:29] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Reorganize s1 codfw future master/candidate T230106 (duration: 00m 49s) [07:28:44] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Traffic, 10Performance-Team (Radar): Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Nuria) @ema So I understand: caching pass needs to be removed (will do so) and since caching response includes an eTag. Is removing caching: pass s... [07:29:09] (03PS1) 10Muehlenhoff: Remove roentgenium/tureis [puppet] - 10https://gerrit.wikimedia.org/r/534017 (https://phabricator.wikimedia.org/T224559) [07:29:28] (03PS1) 10Marostegui: mariadb: Reorganize future master/candidate on s1 codfw [puppet] - 10https://gerrit.wikimedia.org/r/534018 (https://phabricator.wikimedia.org/T230106) [07:31:07] (03PS1) 10Muehlenhoff: Remove DNS entries for roentgenium/tureis [dns] - 10https://gerrit.wikimedia.org/r/534019 (https://phabricator.wikimedia.org/T224559) [07:31:37] (03CR) 10Dzahn: "i see, thanks Moritz. So that sounds like the alternative idea to add base::firewall in the default stanza is also not going to work becau" [puppet] - 10https://gerrit.wikimedia.org/r/534005 (owner: 10Dzahn) [07:33:38] (03CR) 10Muehlenhoff: [C: 04-1] "Adding that fail seems like a fine approach, we only need to make sure that proper roles for setting up a server with and without base::fi" [puppet] - 10https://gerrit.wikimedia.org/r/534005 (owner: 10Dzahn) [07:34:28] !log joal@deploy1001 Started deploy [analytics/refinery@4810dfa]: Regular weekly analytics deploy train [07:35:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Reorganize future master/candidate on s1 codfw [puppet] - 10https://gerrit.wikimedia.org/r/534018 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:38:36] !log Upgrade and reboot db2103 and db2112 to pick up binlog format change - T230106 [07:44:30] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 53.81 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:46:54] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&va [07:46:54] server&var-method=GET [07:47:22] did we just lose lots of app servers on codfw? [07:47:52] PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluste [07:47:52] ethod=GET [07:47:55] yeah, that is what icinga shows on the [07:47:58] dashboard [07:48:13] is it network? [07:48:30] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [07:49:22] I am checking mw2274 and it apparently is ok [07:49:30] RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [07:49:48] they all recovered indeed [07:49:55] i dont see that on icinga ? [07:50:02] yea [07:50:06] they just recovered [07:50:09] it recovered, but pyball was seeing multiple failures [07:50:55] many db errors too, so that would point to an app server or network issue [07:52:28] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 93.27 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:54:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1133 from wikitech', diff saved to https://phabricator.wikimedia.org/P9030 and previous config saved to /var/cache/conftool/dbconfig/20190903-075451-marostegui.json [07:55:18] nothing on the maint-announce calendar or inbox [08:00:13] (03PS2) 10Ema: ATS: log Cookie in labs too [puppet] - 10https://gerrit.wikimedia.org/r/533938 (https://phabricator.wikimedia.org/T227432) [08:01:54] (03CR) 10Ema: [C: 03+2] ATS: log Cookie in labs too [puppet] - 10https://gerrit.wikimedia.org/r/533938 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:02:15] !log joal@deploy1001 deploy aborted: Regular weekly analytics deploy train (duration: 27m 47s) [08:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:51] !log joal@deploy1001 Started deploy [analytics/refinery@4810dfa]: Regular weekly analytics deploy train - Second try [08:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:19] !log joal@deploy1001 Finished deploy [analytics/refinery@4810dfa]: Regular weekly analytics deploy train - Second try (duration: 00m 27s) [08:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:08] i dont see anything obvious for either appservers not networking in codfw except this https://grafana.wikimedia.org/d/000000608/datacenter-overview?panelId=7&fullscreen&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=All [08:07:33] (03CR) 10Ema: [C: 03+1] hiera: Increase ATS SSL session cache capacity to 4M sessions [puppet] - 10https://gerrit.wikimedia.org/r/533994 (https://phabricator.wikimedia.org/T231849) (owner: 10Vgutierrez) [08:09:20] (03CR) 10Vgutierrez: [C: 03+2] hiera: Increase ATS SSL session cache capacity to 4M sessions [puppet] - 10https://gerrit.wikimedia.org/r/533994 (https://phabricator.wikimedia.org/T231849) (owner: 10Vgutierrez) [08:09:36] (03PS2) 10Vgutierrez: hiera: Increase ATS SSL session cache capacity to 4M sessions [puppet] - 10https://gerrit.wikimedia.org/r/533994 (https://phabricator.wikimedia.org/T231849) [08:09:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1133 with weight 0 T229657', diff saved to https://phabricator.wikimedia.org/P9031 and previous config saved to /var/cache/conftool/dbconfig/20190903-080958-marostegui.json [08:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:02] T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 [08:14:40] PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:50] PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:14] RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:24] RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:16:47] Some issues loading pages [08:16:52] Hmm [08:18:12] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:18:34] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:18:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:19:00] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:19:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:19:30] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:19:36] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:19:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:19:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:20:10] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:20:39] (03PS1) 10Ema: cache: allow caching piwik [puppet] - 10https://gerrit.wikimedia.org/r/534034 (https://phabricator.wikimedia.org/T230772) [08:21:04] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:21:06] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:21:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:21:41] !log purging maps / info.json from cache - T231842 [08:21:52] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:22:10] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:22:40] (03CR) 10Nikerabbit: [C: 04-1] Move ContentTranslation out of Beta in jvwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533172 (https://phabricator.wikimedia.org/T231207) (owner: 10KartikMistry) [08:23:08] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:23:20] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:24:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:24:43] gehel: Failed to log message to wiki. Somebody should check the error logs. [08:24:47] (03PS2) 10Giuseppe Lavagetto: restbase: convert to use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532667 [08:24:54] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:25:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:25:20] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:25:50] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:26:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:26:18] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:26:38] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:26:50] !log Add REPLICATION grant to wikiuser and wikiadmin on db1073 with replication enabled - T229657 [08:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:52] T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 [08:26:54] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:27:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:28:41] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Traffic, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10ema) >>! In T230772#5460265, @Nuria wrote: > @ema So I understand: caching pass needs to be removed Yes, and the equivalent change needs to be done for ATS too. I... [08:30:44] (03CR) 10Nuria: [C: 03+1] cache: allow caching piwik [puppet] - 10https://gerrit.wikimedia.org/r/534034 (https://phabricator.wikimedia.org/T230772) (owner: 10Ema) [08:31:11] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 3 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Nuria) [08:33:42] (03PS3) 10Giuseppe Lavagetto: restbase: convert to use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532667 [08:35:12] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 50.92 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:35:46] (03CR) 10Ema: [C: 03+2] ATS: cache responses to cookies [puppet] - 10https://gerrit.wikimedia.org/r/533530 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:35:54] (03PS2) 10Ema: ATS: cache responses to cookies [puppet] - 10https://gerrit.wikimedia.org/r/533530 (https://phabricator.wikimedia.org/T227432) [08:36:13] (03CR) 10Muehlenhoff: [C: 03+1] Add extra key for tstarling [puppet] - 10https://gerrit.wikimedia.org/r/533125 (owner: 10Tim Starling) [08:40:28] (03CR) 10Muehlenhoff: [C: 04-1] "These were already omitted in the initial version of this patch (" [puppet] - 10https://gerrit.wikimedia.org/r/532348 (owner: 10Vgutierrez) [08:41:06] (03PS2) 10Dzahn: remove wikiba.se microsite puppetization [puppet] - 10https://gerrit.wikimedia.org/r/532972 (https://phabricator.wikimedia.org/T99531) [08:42:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:42:23] (03PS3) 10KartikMistry: Move ContentTranslation out of Beta in jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533172 (https://phabricator.wikimedia.org/T231207) [08:42:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:43:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:43:42] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:44:02] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:44:50] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:47:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18143/restbase1025.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/532667 (owner: 10Giuseppe Lavagetto) [08:47:24] (03PS4) 10Giuseppe Lavagetto: restbase: convert to use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532667 [08:47:43] 10Operations, 10serviceops: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff) p:05Triage→03Normal [08:49:06] !log cp1075: pool ats-be with caching enabled T228629 [08:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:09] T228629: ATS Backends: Test live cache_text traffic - https://phabricator.wikimedia.org/T228629 [08:49:15] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=ats-be [08:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:28] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 54.85 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:53:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:53:26] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:54:22] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:54:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:54:50] !log cp1089: varnish-backend-restart due to mbox lag and fetch failures [08:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:02] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:55:08] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:55:24] (03PS1) 10Nuria: Adding caching headers for piwik javascript [puppet] - 10https://gerrit.wikimedia.org/r/534114 (https://phabricator.wikimedia.org/T230772) [08:55:26] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:55:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:56:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:57:02] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [08:57:24] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 109.5 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:58:06] (03CR) 10Gilles: Adding caching headers for piwik javascript (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/534114 (https://phabricator.wikimedia.org/T230772) (owner: 10Nuria) [09:01:02] (03PS2) 10Nuria: Adding caching headers for piwik javascript [puppet] - 10https://gerrit.wikimedia.org/r/534114 (https://phabricator.wikimedia.org/T230772) [09:01:42] 10Operations, 10MobileFrontend, 10Traffic, 10Mobile: https://en.wikipedia.org/wiki/Heteromyidae shows the mobile version on desktop - https://phabricator.wikimedia.org/T231620 (10ema) p:05Triage→03Normal [09:02:24] 10Operations, 10Traffic: Unexpectedly received mobile version of an article while logged out - https://phabricator.wikimedia.org/T231504 (10ema) This should now be fixed. Please let me know if that's not the case! [09:02:33] 10Operations, 10MobileFrontend, 10Traffic, 10Mobile: https://en.wikipedia.org/wiki/Heteromyidae shows the mobile version on desktop - https://phabricator.wikimedia.org/T231620 (10ema) This should now be fixed. Please let me know if that's not the case! [09:02:38] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10akosiaris) [09:03:15] (03PS2) 10Ema: cache: allow caching piwik [puppet] - 10https://gerrit.wikimedia.org/r/534034 (https://phabricator.wikimedia.org/T230772) [09:03:27] !log reset kartotherian password -T231842 [09:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:34] anyone getting 503? [09:04:38] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:04:42] Shanmugamp7: yup [09:04:56] !log cp1085: varnish-backend-restart, mbox lag and fetch failures [09:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:44] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:05:54] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:06:03] (03CR) 10Ema: [C: 03+2] cache: allow caching piwik [puppet] - 10https://gerrit.wikimedia.org/r/534034 (https://phabricator.wikimedia.org/T230772) (owner: 10Ema) [09:06:08] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:06:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:06:32] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:06:54] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:07:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:07:04] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:07:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:07:22] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:07:32] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:07:44] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:07:50] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:07:50] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:08:18] Shanmugamp7: things should look better now [09:08:32] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:08:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:08:40] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:08:44] ema: ok, thanks [09:08:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:08:50] Shanmugamp7: thank you! [09:09:44] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:11:39] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 3 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Nuria) [09:11:57] (03PS2) 10Giuseppe Lavagetto: scb: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532668 [09:16:03] 10Operations, 10Traffic: ATS-tls isn't enforcing the same list of curves as nginx during TLS handshake - https://phabricator.wikimedia.org/T231859 (10Vgutierrez) [09:16:17] !log Deploy refactor of Zuul pipelines which might mean that some repos/branches would miss jobs or have extra unwanted jobs. In such case please fill in a task against #continuous-integration-config [09:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:55] 10Operations, 10Traffic: ATS-tls isn't enforcing the same list of curves as nginx during TLS handshake - https://phabricator.wikimedia.org/T231859 (10Vgutierrez) p:05Triage→03Normal [09:21:30] (03PS3) 10Giuseppe Lavagetto: scb: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532668 [09:26:48] (03PS1) 10Vgutierrez: Release 8.0.5-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/534118 (https://phabricator.wikimedia.org/T231859) [09:27:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [09:27:44] (03PS1) 10Dzahn: aptrepo: attempt to fix ListShellHook for envoy [puppet] - 10https://gerrit.wikimedia.org/r/534119 [09:27:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18145/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/532668 (owner: 10Giuseppe Lavagetto) [09:28:47] (03CR) 10Muehlenhoff: [C: 03+1] "These are just test hosts anyway." [puppet] - 10https://gerrit.wikimedia.org/r/531235 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:30:02] (03PS2) 10Muehlenhoff: Switch Stas to volunteer account [puppet] - 10https://gerrit.wikimedia.org/r/533859 [09:30:30] (03CR) 10Muehlenhoff: [C: 03+1] aptrepo: attempt to fix ListShellHook for envoy [puppet] - 10https://gerrit.wikimedia.org/r/534119 (owner: 10Dzahn) [09:30:37] (03CR) 10Gilles: [C: 03+1] Adding caching headers for piwik javascript [puppet] - 10https://gerrit.wikimedia.org/r/534114 (https://phabricator.wikimedia.org/T230772) (owner: 10Nuria) [09:30:58] (03CR) 10Dzahn: [C: 03+2] aptrepo: attempt to fix ListShellHook for envoy [puppet] - 10https://gerrit.wikimedia.org/r/534119 (owner: 10Dzahn) [09:31:13] (03PS2) 10Dzahn: aptrepo: attempt to fix ListShellHook for envoy [puppet] - 10https://gerrit.wikimedia.org/r/534119 [09:32:23] (03CR) 10Muehlenhoff: [C: 03+2] Switch Stas to volunteer account [puppet] - 10https://gerrit.wikimedia.org/r/533859 (owner: 10Muehlenhoff) [09:33:56] (03PS1) 10Vgutierrez: ATS: Configure a list of curves to be offered during the TLS handshake [puppet] - 10https://gerrit.wikimedia.org/r/534123 (https://phabricator.wikimedia.org/T231859) [09:34:17] (03PS3) 10Dzahn: aptrepo: attempt to fix ListShellHook for envoy [puppet] - 10https://gerrit.wikimedia.org/r/534119 [09:37:57] (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1001/18146/" [puppet] - 10https://gerrit.wikimedia.org/r/534123 (https://phabricator.wikimedia.org/T231859) (owner: 10Vgutierrez) [09:46:12] !log install1002 - import GPG key for getenvoy repo, importing envoy for jessie with reprepro update [09:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:30] (03PS2) 10Giuseppe Lavagetto: ores: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532669 [09:46:51] !log moved uid=smalyshev from cn=wmf to cn=nda [09:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:16] 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team): Revisit default settings for c-foreach-restart - https://phabricator.wikimedia.org/T198787 (10mobrovac) a:05mobrovac→03None [09:48:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18147/ores1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/532669 (owner: 10Giuseppe Lavagetto) [09:49:00] (03PS1) 10Dzahn: tlsproxy/envoy: fix package name for envoy on jessie [puppet] - 10https://gerrit.wikimedia.org/r/534125 [09:51:28] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy/envoy: fix package name for envoy on jessie [puppet] - 10https://gerrit.wikimedia.org/r/534125 (owner: 10Dzahn) [09:53:21] (03CR) 10Ema: [C: 03+1] ATS: Configure a list of curves to be offered during the TLS handshake [puppet] - 10https://gerrit.wikimedia.org/r/534123 (https://phabricator.wikimedia.org/T231859) (owner: 10Vgutierrez) [09:53:52] (03CR) 10Ema: [C: 03+1] Release 8.0.5-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/534118 (https://phabricator.wikimedia.org/T231859) (owner: 10Vgutierrez) [09:54:50] (03PS1) 10Alexandros Kosiaris: calico: Enabled felix prometheus endpoint [puppet] - 10https://gerrit.wikimedia.org/r/534126 [09:59:08] <_joe_> !log removing old lvs-related scripts from ores* [09:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:30] (03PS2) 10Dzahn: tlsproxy/envoy: fix package name for envoy on jessie [puppet] - 10https://gerrit.wikimedia.org/r/534125 [10:00:24] (03PS2) 10Giuseppe Lavagetto: proton: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532670 [10:02:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18149/proton1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/532670 (owner: 10Giuseppe Lavagetto) [10:03:43] (03PS1) 10Dzahn: requesttracker: include envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534128 [10:03:54] (03CR) 10Dzahn: [C: 03+2] tlsproxy/envoy: fix package name for envoy on jessie [puppet] - 10https://gerrit.wikimedia.org/r/534125 (owner: 10Dzahn) [10:04:25] (03PS3) 10Dzahn: tlsproxy/envoy: fix package name for envoy on jessie [puppet] - 10https://gerrit.wikimedia.org/r/534125 [10:07:18] <_joe_> mutante: the puppetization on jessie is completely untested, don't hate me if nothing works :P [10:07:39] (03PS1) 10Dzahn: add discovery name for RT [dns] - 10https://gerrit.wikimedia.org/r/534129 [10:07:42] (03PS2) 10Giuseppe Lavagetto: openldap,labweb: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532671 [10:07:48] _joe_: ok :) i am using the RT server to test it.. even though i want to replace that with stretch anyways [10:08:01] (03CR) 10jerkins-bot: [V: 04-1] add discovery name for RT [dns] - 10https://gerrit.wikimedia.org/r/534129 (owner: 10Dzahn) [10:08:14] <_joe_> mutante: before you add that discovery name, we need to change a few things on the puppet side though [10:08:43] <_joe_> oh just a cname heh [10:08:57] yea, i am using the CNAME method like for planet and people etc [10:09:07] copied that from the first example from e.ma [10:09:26] i would be happy to do that part later though [10:09:56] actually i want to delete ununpentium anyways. this was just to confirm the install works [10:10:57] real goals: replace ununpentium with rt1001 (stretch) and then replace rt with rt-static and stop running the Perl code [10:13:45] duh. that's a public IP as well.. right. [10:14:11] glad that dns-lint is smart nowadays [10:15:19] (03PS2) 10Dzahn: requesttracker: include envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534128 [10:15:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:15:58] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:16:10] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [10:16:14] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [10:16:16] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:16:43] (03CR) 10Dzahn: [C: 04-2] "ununpentium is in wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/534129 (owner: 10Dzahn) [10:16:56] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 53.03 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:17:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:17:04] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [10:17:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:17:17] !log cp1083: varnish-backend-restart -- mbox lag, fetch failures [10:17:26] looks like before on https://grafana.wikimedia.org/d/000000479/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:17:34] ack 1083 [10:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:51] mutante: yup, 1083 this time [10:18:12] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [10:18:54] (03CR) 10Dzahn: [C: 03+2] requesttracker: include envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534128 (owner: 10Dzahn) [10:19:26] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [10:20:34] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:20:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:20:58] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [10:21:04] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:21:22] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [10:21:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:21:50] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [10:21:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:24:52] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 56.13 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:31:50] (03PS3) 10Giuseppe Lavagetto: openldap,labweb: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532671 [10:36:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18152/" [puppet] - 10https://gerrit.wikimedia.org/r/532671 (owner: 10Giuseppe Lavagetto) [10:40:44] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 72.81 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:44:38] (03PS1) 10Dzahn: add certificate for rt.discovery [puppet] - 10https://gerrit.wikimedia.org/r/534132 [10:44:53] (03PS1) 10Dzahn: add fake SSL key for rt.discovery [labs/private] - 10https://gerrit.wikimedia.org/r/534133 [10:47:37] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake SSL key for rt.discovery [labs/private] - 10https://gerrit.wikimedia.org/r/534133 (owner: 10Dzahn) [10:48:26] (03CR) 10Dzahn: [C: 03+2] add certificate for rt.discovery [puppet] - 10https://gerrit.wikimedia.org/r/534132 (owner: 10Dzahn) [10:48:37] (03PS2) 10Dzahn: add certificate for rt.discovery [puppet] - 10https://gerrit.wikimedia.org/r/534132 [10:50:31] 10Operations, 10Traffic: Cannot download STL files due to network error - https://phabricator.wikimedia.org/T231422 (10ema) The issue happens due to varnish-frontend giving up the fetch from ATS because of lack of free space: ` -- ObjStatus 200 -- ObjReason OK -- ObjHeader Date: Tue, 03 Sep... [10:53:13] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 3 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Nuria) I think what is left here is to restart apache for settings to take place [10:53:48] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [10:55:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 3 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Nuria) Restarted with : sudo apache2ctl restart [10:58:40] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T1100). [11:00:05] Zoranzoki21, Amir1, and raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] o/ [11:01:11] o/ [11:02:00] I can SWAT I guess [11:02:42] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533538 (https://phabricator.wikimedia.org/T231654) (owner: 10Zoranzoki21) [11:03:07] bah, the user is not around :/ [11:03:20] (03CR) 10Ladsgroup: "The user is not around" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533538 (https://phabricator.wikimedia.org/T231654) (owner: 10Zoranzoki21) [11:03:40] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533882 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:05:16] Amir1 - mine is almost a noop. I only need to verify that js config has changed [11:05:20] (03Merged) 10jenkins-bot: Enable WRITE_BOTH for items term store for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533882 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:05:55] raynor: I will merge yours quickly [11:06:24] kk, thx [11:06:53] (03PS2) 10Ladsgroup: Bump MobileWebUIActionsTracking sampling rate to 1 percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533930 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [11:07:04] (03CR) 10Ladsgroup: [C: 03+2] Bump MobileWebUIActionsTracking sampling rate to 1 percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533930 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [11:07:06] (03CR) 10jenkins-bot: Enable WRITE_BOTH for items term store for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533882 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:07:14] raynor: is it testable on mwdebug1001? [11:07:17] mwdebug1002 [11:07:27] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:533882|Enable WRITE_BOTH for items term store for wikidatawiki (T225055)]] (duration: 00m 55s) [11:07:35] Amir1, I just need to check mw.confg, that's all [11:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:40] T225055: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_BOTH - https://phabricator.wikimedia.org/T225055 [11:07:40] marostegui: This went live [11:08:06] once it get's deployed I'll see changes in grafana [11:08:12] raynor: cool [11:08:48] we went with super safe events sampling rate of 0.01%, and as you might expect, graph shows ~1 event from time to time [11:09:12] LOL [11:09:28] (03Merged) 10jenkins-bot: Bump MobileWebUIActionsTracking sampling rate to 1 percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533930 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [11:09:50] (03CR) 10jenkins-bot: Bump MobileWebUIActionsTracking sampling rate to 1 percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533930 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [11:10:15] raynor: going live [11:10:48] awesome, thx [11:10:59] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:533930|Bump MobileWebUIActionsTracking sampling rate to 1 percent (T220016)]] (duration: 00m 53s) [11:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:02] T220016: Create, and deploy working MobileWebUIActionsTracking schema - https://phabricator.wikimedia.org/T220016 [11:11:25] raynor: ^ [11:11:40] thx, checking [11:13:24] lovely cache ;) [11:14:36] PROBLEM - Check systemd state on ununpentium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:52] Hi, sorry for lating! I am in bus because I go to school [11:16:03] I saw to my patch for bswiki is voted with +2 [11:16:10] How it's going? [11:16:25] Zoranzoki21: I had to stop it because you weren't around. I can continue now [11:16:55] I don't have access to X-Wikimedia-Debug because I am on phone [11:17:00] raynor: mediawiki basically is a caching service with some functionalities around it [11:17:09] :) [11:17:09] Urbanecm usually does it when I am not around [11:17:22] Zoranzoki21: Is there a way I can test it? [11:17:54] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:18:15] I think no.. [11:18:40] I changed name of wgMetaNamespaceTalk and after merge of patch you should run namespaceDupes as i Know [11:19:08] raynor: Amir1: are you aware of the above failure on mobileapps? [11:19:32] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:19:45] Zoranzoki21: If it's a namespace on bswki, I should be able to check it. [11:19:48] ah, it got fixed [11:20:09] jynus: no, I think it's an intermittent issue (btw. we should move away from scb :D) [11:20:11] Amir1: I think it is true.. It is Project_talk namespace as I know [11:20:11] jynus, Amir1: yup, I saw that, it's not something my patch could cause it [11:20:19] kubeternetes ftw [11:20:28] ok, np now [11:22:09] (03PS2) 10Ladsgroup: Fix wgMetaNamespaceTalk for bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533538 (https://phabricator.wikimedia.org/T231654) (owner: 10Zoranzoki21) [11:22:25] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533538 (https://phabricator.wikimedia.org/T231654) (owner: 10Zoranzoki21) [11:23:32] (03Merged) 10jenkins-bot: Fix wgMetaNamespaceTalk for bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533538 (https://phabricator.wikimedia.org/T231654) (owner: 10Zoranzoki21) [11:23:47] (03CR) 10jenkins-bot: Fix wgMetaNamespaceTalk for bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533538 (https://phabricator.wikimedia.org/T231654) (owner: 10Zoranzoki21) [11:24:05] Amir1, FYI: the wgWMEMobileWebUIActionsTracking config is still set to 0.0001 ;/ [11:24:18] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:24:44] raynor: oh oh oh no, I forgot to rebase [11:24:56] I'm sorry [11:25:01] deploying now [11:25:01] no worries [11:25:25] I thought it's just cache thing, ResourceLoader output is cached for ~5 mins if I remember it right [11:25:44] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:533930|Bump MobileWebUIActionsTracking sampling rate to 1 percent (T220016)]] (duration: 00m 52s) [11:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:46] T220016: Create, and deploy working MobileWebUIActionsTracking schema - https://phabricator.wikimedia.org/T220016 [11:25:54] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:26:22] raynor: sorry. Can you try again [11:26:44] ok, now it works, thx Amir1 [11:26:56] and don't worry, it's ok [11:27:27] Yes, each people can do something wrong noone is non-wrong [11:27:45] Amir1: How is going with my patch? [11:28:00] Zoranzoki21: being deployed [11:28:08] Amir1: Ok [11:28:33] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:533538|Fix wgMetaNamespaceTalk for bswiki (T231654)]] (duration: 00m 54s) [11:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:36] T231654: Update talk namespace for Bosnian in InitialiseSettings.php - https://phabricator.wikimedia.org/T231654 [11:29:13] !log ladsgroup@mwmaint1002:~$ mwscript namespaceDupes.php bswiki --fix (T231654) [11:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:24] !log EU SWAT is done [11:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:38] Done? :) [11:30:42] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:31:31] Amir1: Can I go? [11:31:42] Zoranzoki21: yup [11:31:45] Tnx [11:32:18] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:35:48] !log ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --to-id 1000 --sleep 2 (T225056) [11:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:51] T225056: Run Item Terms Rebuild script - https://phabricator.wikimedia.org/T225056 [11:39:00] (03PS2) 10Giuseppe Lavagetto: role::lvs::realserver: remove from puppet [puppet] - 10https://gerrit.wikimedia.org/r/532673 [11:42:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "according to cumin, only one server that is currently down still has the resources removed here." [puppet] - 10https://gerrit.wikimedia.org/r/532673 (owner: 10Giuseppe Lavagetto) [11:47:07] jouncebot: next [11:47:07] In 0 hour(s) and 12 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T1200) [11:47:08] jouncebot: now [11:47:09] For the next 0 hour(s) and 12 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T1100) [11:48:02] !log Downtime m5 hosts T229657 [11:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:05] T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 [11:49:23] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [11:55:31] !log Change topology on m5 and make everything replicate from db1133 - T229657 [11:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:35] T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 [11:58:55] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T1200) [12:00:23] (03PS1) 10Marostegui: db-eqiad.php: Promote db1133 as wikitech master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534144 (https://phabricator.wikimedia.org/T229657) [12:02:04] (03CR) 10Marostegui: mariadb: Promote db1133 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/529331 (https://phabricator.wikimedia.org/T229657) (owner: 10Marostegui) [12:02:11] (03PS7) 10Marostegui: mariadb: Promote db1133 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/529331 (https://phabricator.wikimedia.org/T229657) [12:02:17] (03PS2) 10Hashar: Remove role::ci::slave::webperformance [puppet] - 10https://gerrit.wikimedia.org/r/531420 (https://phabricator.wikimedia.org/T225416) [12:02:21] !log Disable puppet on db1073 and db1133 - T229657 [12:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:24] T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 [12:03:57] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1133 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/529331 (https://phabricator.wikimedia.org/T229657) (owner: 10Marostegui) [12:06:43] (03CR) 10Marostegui: wmnet: Promote db1133 to m5 master [dns] - 10https://gerrit.wikimedia.org/r/529333 (https://phabricator.wikimedia.org/T229657) (owner: 10Marostegui) [12:06:50] (03CR) 10Hashar: "Some easy cleanup :]" [puppet] - 10https://gerrit.wikimedia.org/r/531420 (https://phabricator.wikimedia.org/T225416) (owner: 10Hashar) [12:07:33] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [12:07:58] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) @JHedden all the PRE steps are done. [12:11:47] jouncebot: now [12:11:48] For the next 0 hour(s) and 48 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T1200) [12:12:15] !sal [12:12:16] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [12:12:27] oh swat is done. thank you :] [12:12:41] I am going to promote 1.34.0-wmf.20 to rest of wikis since there are no more blocker [12:15:30] (03PS1) 10Hashar: all wikis to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534147 [12:15:32] (03CR) 10Hashar: [C: 03+2] all wikis to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534147 (owner: 10Hashar) [12:18:05] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534147 (owner: 10Hashar) [12:19:48] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.20 [12:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:50] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534147 (owner: 10Hashar) [12:20:18] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:20:34] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:21:54] :-\ [12:22:04] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [12:22:31] and the high latency is I guess just the bytecode cache being primed [12:23:40] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:26:26] hashar: not sure why only codfw though [12:26:34] just noise imho [12:26:50] I would guess the icinga check has hit a mw server that hand't been hit previousl [12:26:54] and ends up timing out [12:27:08] == noise imho :] [12:27:22] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog: Lake Huron missing due to apparent OSM vandalism - https://phabricator.wikimedia.org/T231691 (10Haros) >>! In T231691#5457015, @Pikne wrote: > @Gehel, there's also a lake in Norway waiting for monthly automatic update to reappear in smaller zoom le... [12:27:28] cutting branches [12:27:38] the metric above isn't checked directly by icinga, it is extracted from logs of all appservers, but indeed that could be a minority of appservers influencing the average [12:27:57] checked by icinga on a single host that is [12:29:26] !log Cutting wmf/1.34.0-wmf.21 # T220746 [12:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:29] T220746: 1.34.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T220746 [12:33:26] PROBLEM - SSH access on cobalt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit [12:34:48] cobalt that is gerrit :-\ [12:34:57] apparently due to the branch cut [12:35:02] hashar: gerrit seems to be down for me indeed [12:35:03] that train is going to take a while ;-\\ [12:35:10] ah just slow [12:35:12] it is back [12:35:35] I just refreshed and it is back for me too [12:36:28] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.15.14-16-g855b179b5f (SSHD-CORE-1.6.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [12:37:25] we need moar T226240 [12:37:26] T226240: Create mirror of Gerrit repositories for consumption by various tools - https://phabricator.wikimedia.org/T226240 [12:38:04] I think the new gerrit readonly replica is being used by several things already [12:38:18] also Soon there will be a newer, stronger machine as the primary [12:38:22] no idea what might have happened [12:38:27] yeah, that is why I said we need it moar :-D [12:38:41] last time I checked the bulk of the traffic got moved there [12:38:50] this time, I don't know what happened :-\ [12:39:00] !log upgrading mwdebug2001 to PHP 7.2.22 [12:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:03] and yes, I know it is not simple, and that partition tolerance is an issue [12:39:36] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=cobalt&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc&from=now-30m&to=now [12:40:03] ~1.5 times more CPU usage and load is generally increased [12:40:08] I see nothing in the logs however [12:40:36] I am cutting the wmf branches on repos so that takes a bit of cycles [12:40:55] but that alone does not explain the load spike :-\ [12:41:02] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:41:48] yeah, deployment operations should only be trivial operations AFAIK from cobalt point of view [12:43:21] looks like gc times are on their way to the sky, from https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?orgId=1&from=1567510996784&to=1567514596784&panelId=14&fullscreen&var-Application=&var-Window=30m [12:43:52] https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?orgId=1&panelId=14&fullscreen&from=1567493022605&to=1567514622605&var-Application=&var-Window=30m [12:44:03] there's an increase in network traffic as well. It does seem periodic, but somehow between 12:25 and 12:32 the machine was receiving 6-7MB/s of traffic (which it normally doesn't [12:44:06] yeah, that. This happens during branch cut recently [12:44:24] interesting [12:44:25] gc thrashes and lots of GC pause is indistinguishable from the service being down :( [12:45:32] evidently a lot of memory gets allocated for adding a ref; although that seems like a recent development. I've noticed it for the past handful of train branch cuts. [12:46:39] (03CR) 10Jhedden: [C: 03+1] wmnet: Promote db1133 to m5 master [dns] - 10https://gerrit.wikimedia.org/r/529333 (https://phabricator.wikimedia.org/T229657) (owner: 10Marostegui) [12:46:45] any ideas? aside from hw upgrade? Would physically separate large/critical repos help? [12:47:26] or it is more of a configuration issue than load? [12:47:53] (03PS3) 10Nuria: Adding caching headers for piwik javascript [puppet] - 10https://gerrit.wikimedia.org/r/534114 (https://phabricator.wikimedia.org/T230772) [12:48:56] I think it may be due more to configuration since this is a recent development. My theory recently was that we had too many old refs laying around and I did a bunch cleanup, but that didn't solve the issue. [12:49:35] jeh: morning! ready to start in 10 minutes? [12:50:13] marostegui: yep, will shut down the OpenStack scheduler just before we start [12:50:25] jeh: excellent, I will confirm with you before starting [12:50:45] maybe sending a reminder on cloud? [12:50:51] (IRC) [12:51:05] jynus: good idea, I'll do that [12:51:25] to avoid mass-reports [12:52:03] as it may affect temporarolly striker and other tools [12:52:44] !log uploaded PHP 7.2.22 to component/php72 T230024 [12:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:47] T230024: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 [12:54:52] (03CR) 10Marostegui: [C: 03+2] wmnet: Promote db1133 to m5 master [dns] - 10https://gerrit.wikimedia.org/r/529333 (https://phabricator.wikimedia.org/T229657) (owner: 10Marostegui) [12:55:18] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog: Lake Huron missing due to apparent OSM vandalism - https://phabricator.wikimedia.org/T231691 (10Gehel) >>! In T231691#5460894, @Haros wrote: > We really need a way to trigger a refresh without creating a new issue. I can do so if that is necessary,... [12:56:19] (03CR) 10Krinkle: "I'm confused - trigger_error cannot trigger the fatal error page, it should emit a syslog warning only. Try an uncaught Exception?" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [12:56:20] is the m5 master failover at the same time as the train? [12:56:31] Lucas_WMDE: should not interfere [12:56:36] ok [12:56:46] Lucas_WMDE: only affects wikitech [12:57:33] cdanis: ping [12:57:38] o/ [12:57:43] I've been watching :) [12:57:43] o/ [12:57:49] thanks for being around! [12:58:46] The main test I want to check is if wikitech becomes indeed read-only after running dbctl [12:59:00] sure [13:00:04] hashar: How many deployers does it take to do MediaWiki train - European version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T1300). [13:00:04] marostegui, jeh, and jynus: Dear deployers, time to do the m5 database master failover deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T1300). [13:00:07] jeh: let me know when you are ready and I can start [13:00:44] marostegui: all set, you can begin [13:00:47] ok [13:00:48] starting [13:00:49] !log Failover m5 from db1073 to db1133 - T229657 [13:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:52] T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 [13:01:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set wikitech as read-only for maintenance T229657', diff saved to https://phabricator.wikimedia.org/P9033 and previous config saved to /var/cache/conftool/dbconfig/20190903-130113-marostegui.json [13:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:26] I can still edit [13:01:29] cdanis: dbctl failed with a warning [13:01:39] still can edit [13:01:50] yep, wikitech is not read-only [13:02:02] what did dbctl output? [13:02:24] https://phabricator.wikimedia.org/P9034 [13:02:30] PROBLEM - DPKG on mwdebug2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:02:50] omg json schema [13:03:00] cdanis: this is what I ran: dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T229657 " && dbctl config commit -m "Set wikitech as read-only for maintenance T229657" [13:03:22] yep all looks good [13:03:30] marostegui: labswiki not wikitech [13:03:36] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:03:41] the section is called wikitech, Reedy [13:03:46] the wiki is called labswiki [13:03:53] Reedy: but the section is called wikitech [13:04:06] RECOVERY - DPKG on mwdebug2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:04:25] cdanis: dbctl config diff still shows uncommitted stuff (as it failed) [13:05:01] marostegui: indeed, the json schema as written doesn't wikitech to be present in readOnlyBySection 🤦 [13:05:05] I am pushing a patch now [13:05:12] cdanis: great - thanks! :) [13:05:22] we have time, our maintenance window was 30 minutes just in case! [13:05:24] (03PS1) 10CDanis: dbctl: schema: allow wikitech readonly [puppet] - 10https://gerrit.wikimedia.org/r/534150 [13:05:59] we could also proceed without it if we were in a hurry, read only is detected also by mysql ro [13:06:13] but it would be less "smooth" [13:06:23] yeah, let's try to merge cdanis patch and run the commit [13:06:26] +1 [13:06:27] (03CR) 10CDanis: [C: 03+2] dbctl: schema: allow wikitech readonly [puppet] - 10https://gerrit.wikimedia.org/r/534150 (owner: 10CDanis) [13:06:34] (03CR) 10CDanis: [V: 03+2 C: 03+2] dbctl: schema: allow wikitech readonly [puppet] - 10https://gerrit.wikimedia.org/r/534150 (owner: 10CDanis) [13:06:48] I see [13:07:02] cdanis: that is a mistake we have done many times- ignore wikitech [13:07:07] indeed [13:07:18] and that's why we hvae a task to move it to s5 or something \o/ :) [13:07:19] as it is a relatively new cluster (it didn't use to be part of the main installation) [13:07:26] cdanis: let me know when I can try the commit again [13:07:40] meh, gerrit is very slow right now even on fetch operations [13:07:46] yeah [13:07:50] (03PS1) 10Odder: Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534151 (https://phabricator.wikimedia.org/T230120) [13:08:18] ok marostegui give the commit a try again [13:08:21] ok [13:08:26] trying going read-only again [13:08:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set wikitech as read-only for maintenance T229657', diff saved to https://phabricator.wikimedia.org/P9035 and previous config saved to /var/cache/conftool/dbconfig/20190903-130839-marostegui.json [13:08:42] it went thru fine now, let's check [13:08:42] marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [13:08:43] T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 [13:08:53] read only works for me [13:08:57] Warning: The database has been locked for maintenance, so you will not be able to save your edits right now. Yo [13:09:01] confirm^ [13:09:02] ok, proceeding [13:09:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1133 to wikitech master T229657', diff saved to https://phabricator.wikimedia.org/P9036 and previous config saved to /var/cache/conftool/dbconfig/20190903-130937-marostegui.json [13:09:41] marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [13:09:53] !log upgrading remaining mwdebug servers to PHP 7.2.22 T230024 [13:09:54] heh @ stashbot [13:09:56] moritzm: Failed to log message to wiki. Somebody should check the error logs. [13:09:57] T230024: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 [13:10:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set wikitech back to RW after maintenance T229657', diff saved to https://phabricator.wikimedia.org/P9037 and previous config saved to /var/cache/conftool/dbconfig/20190903-131000-marostegui.json [13:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:06] wikitech should be writtable again [13:10:12] jeh: changing DNS now [13:10:14] stashbot seems to confirm :) [13:10:15] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [13:10:17] ah, normal that log fails if wikitech in read only :-) [13:10:31] I can edit fine on wikitech [13:10:33] loop dependency [13:10:42] yeah, I mean at the time [13:11:02] marostegui: I confirm I can too [13:11:06] jeh: DNS change went thru, TTL is 1M [13:11:20] marostegui: OK, I'll keep my eye on the clients [13:11:23] going to reload haproxy on dbproxy1005 (which is not used) [13:11:23] 4 mediawiki errors [13:11:44] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluste [13:11:44] ethod=GET [13:11:47] !log Reload haproxy on dbproxy1005 T229657 [13:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:54] al regarding topology, nothing not expected [13:12:08] no ongoing mw errors [13:12:25] jynus: can you check tendril and zarcillo for me? db1133 should be the new master [13:12:31] doing [13:12:35] was next on my list [13:12:36] thanks [13:12:39] dbproxy1005 reloaded [13:12:49] tendril looks good [13:13:12] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [13:13:12] jeh: how are things looking from your end? [13:13:28] and so does zarillo, marostegui [13:13:34] !log Re-enable puppet on db1073 and db1133 T229657 [13:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:37] jynus: great - thanks! [13:13:53] also no load issues- although none was expected [13:14:02] (03PS1) 10Odder: Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534152 (https://phabricator.wikimedia.org/T230120) [13:14:14] oops, wikibugs was kicked? [13:14:19] marostegui: DNS is changed, services look good as they're coming back up [13:14:25] great! [13:14:36] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [13:14:39] activity on old master? [13:14:39] Lucas_WMDE: it does that occasionally, it will reconnect [13:14:45] (that was a question) [13:14:53] jynus:nope [13:14:57] cool [13:14:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1073 from wikitech T229657', diff saved to https://phabricator.wikimedia.org/P9038 and previous config saved to /var/cache/conftool/dbconfig/20190903-131456-marostegui.json [13:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:00] T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 [13:15:06] 10Operations, 10Traffic: Unexpectedly received mobile version of an article while logged out - https://phabricator.wikimedia.org/T231504 (10Mholloway) 05Open→03Resolved Sounds like this can be resolved, then. I can no longer reproduce the issue, but I'll reopen if I see any further cases. Thanks! [13:15:15] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [13:15:32] Lucas_WMDE: can you think of any relation between wikibugs and wikitech? [13:15:53] no idea [13:15:57] (03CR) 10jerkins-bot: [V: 04-1] Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534152 (https://phabricator.wikimedia.org/T230120) (owner: 10Odder) [13:16:02] !log Gerrit has some random times out from time to time (no reason) [13:16:03] I didn’t know it would reconnect by itself, in that case probably no problem [13:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:12] the excess flood on wikibugs is usually just it posts too many updates too quickly [13:16:12] maybe some tools could have temporary issues? [13:17:00] wikibugs is the one that posts phab updates? [13:17:40] yeah [13:18:05] any user reports of issues on irc? I see nothing pon phab- although it normally takes some time [13:18:35] nothing strange on logs [13:18:37] jeh: everything looking good? [13:19:04] there's a lot to check, but so far so good [13:19:19] yeah, take your time [13:19:21] :-D [13:19:52] the only worrying issues is some exceptions related to SDC/wikibase [13:20:09] but those happened for a long time before the switch [13:20:18] jeh: good, let us know if we can help [13:20:40] I will merge the MW config patch once the train is finished [13:20:43] No rush for that one [13:20:44] https://tools.wmflabs.org/admin/tools seems to work [13:20:50] Apart fro m that everything looks good from this end [13:20:54] but I don't have admin privileges there [13:21:40] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) This was done successfully. wikitech read only start: 13:08:40 wikit... [13:21:53] (03PS1) 10CDanis: dbctl: schema: allow wikitech in readOnlyBySection [software/conftool] - 10https://gerrit.wikimedia.org/r/534153 [13:21:57] not sure I understand the difference between https://tools.wmflabs.org/admin/tools and https://toolsadmin.wikimedia.org but both seem to work [13:22:05] cdanis: thanks for the quick patch :) [13:22:13] Glad we had you around! [13:22:26] thanks cdanis [13:22:31] also marostegui, great job as usual [13:22:38] yeah no, sorry for the trouble [13:22:56] I think _joe_ wrote that schema originally so clearly it’s his fault ;) [13:22:56] no trouble at all :) [13:23:42] <_joe_> I don't think we had wikitech in the readonlybysection back in the day :P [13:23:47] cdanis: we've had issues with wikitech for years, because it is a wiki but it is on a misc section rather than on s1-s8, that's why we have https://phabricator.wikimedia.org/T167973 [13:23:51] <_joe_> it was treated completely separately [13:24:25] Hi everyone, any ideas why https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/534152/ failed? [13:24:47] <_joe_> odder: I thinkthis is the wrong channel to ask questions about CI [13:25:01] <_joe_> #wikimedia-releng is probably a better place [13:25:11] !log 1.34.0-wmf.21 cut [13:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:12] FWIW with a recent change I made to dbctl (not released yet) it would have become obvious that it would have failed validation before config commit [13:25:14] jeh: I am going to go to a meeting, but please once you are fully happy from your side, comment on the task and I will take are of the pending thing on the task (merge MW) after my meeting, as it is not urgent and the train is running [13:25:15] Never knew this existed until just now :-P [13:25:16] ccccccljuuntjnbundvnkeefhtrcfhcgfchribklhnjj [13:25:27] ffs [13:25:28] marostegui: will do, thanks [13:25:29] Reedy: get your cat off the keyboard [13:25:31] <_joe_> Reedy: I want a yubykey too [13:25:51] _joe_: OIT will send you one, and then I’ll help you put your ssh key on it [13:26:04] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [13:26:08] <_joe_> marostegui: is the migration done? [13:26:09] !log Gerrit should be fine again, apparently was due to the wmf branch cut taking too much resources (sic) - T231872 filled to investigate [13:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:11] T231872: Gerrit GC thrashing during branch cut - https://phabricator.wikimedia.org/T231872 [13:26:13] Or just buy one and reimburse it... Because they don't get any sort of discount [13:26:22] _joe_: which migration? the failover? [13:26:42] <_joe_> yes [13:26:47] _joe_: yep [13:26:50] Btw there’s something going on with the app servers, elevated latency and lots more mcrouter traffic than usual [13:26:53] I can take over when you go to meeting [13:26:58] yeah, cdanis saw that [13:27:06] <_joe_> cdanis: same pattern we saw yesterday [13:27:19] <_joe_> it grew twice since this morning [13:27:20] but this correlates to the db change [13:27:40] <_joe_> what db change? [13:27:53] wikitech db master change [13:28:03] <_joe_> well how can that impact the application servers? [13:28:14] <_joe_> that do not connect to that database at all? [13:28:23] ? [13:28:35] dbctl operation is done [13:28:46] aka confctl [13:28:50] <_joe_> so? [13:29:06] I am just correlating timestamps [13:29:28] if X happens at the same time than Y, maybe (not sure) could be related [13:29:41] <_joe_> ok, this definitely can't be [13:30:09] <_joe_> so this is only happening on appservers AIUI [13:30:31] <_joe_> we saw this pattern yesterday as well [13:30:47] <_joe_> looks like some objects in APC are suddenly invalid or something [13:31:23] there was also a slight HTTP traffic increase about 10 minutes before the elevated latency [13:32:05] <_joe_> the only real way to track this down further would be to have latency data aggregated by endpoint and wiki [13:32:10] <_joe_> things we don't do [13:32:42] <_joe_> I think we should open a ticket requiring further investigation / instrumentation [13:32:50] <_joe_> this happened yesterday afternoon as well [13:33:20] <_joe_> it would be interesting to look at profiling data before / after, to see whaat parts of the code are hotter [13:33:36] <_joe_> we have the flamegraphs for that, I'll look later [13:35:38] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [13:36:33] (03PS1) 10Hashar: Group0 to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534154 (https://phabricator.wikimedia.org/T220746) [13:37:12] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:37:53] (03PS4) 10Ottomata: Adding caching headers for piwik javascript [puppet] - 10https://gerrit.wikimedia.org/r/534114 (https://phabricator.wikimedia.org/T230772) (owner: 10Nuria) [13:38:22] !log hashar@deploy1001 Started scap: testwiki to 1.34.0-wmf.21 and rebuild l10n cache - T220746 [13:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:25] T220746: 1.34.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T220746 [13:38:38] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Overall pretty good chart, thanks for running the benchmark. I 've left some minor comments around." (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287) (owner: 10MSantos) [13:38:59] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Adding caching headers for piwik javascript [puppet] - 10https://gerrit.wikimedia.org/r/534114 (https://phabricator.wikimedia.org/T230772) (owner: 10Nuria) [13:41:43] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10JHedden) Cloud VPS OpenStack has been fully switched over and all services are ba... [13:42:00] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:42:30] (03PS2) 10CDanis: dbctl: schema: allow wikitech in readOnlyBySection [software/conftool] - 10https://gerrit.wikimedia.org/r/534153 [13:44:03] !log joal@deploy1001 Started deploy [analytics/refinery@8b17711]: Fixes for regualr analytics deploy [13:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:11] 10Operations, 10Commons, 10Traffic: Downloading the original SVG of a file on Commons serves a truncated stream - https://phabricator.wikimedia.org/T231753 (10ema) p:05Triage→03Normal [13:47:47] 10Operations, 10Traffic: Cannot download STL files due to network error - https://phabricator.wikimedia.org/T231422 (10ema) [13:47:51] 10Operations, 10Commons, 10Traffic: Downloading the original SVG of a file on Commons serves a truncated stream - https://phabricator.wikimedia.org/T231753 (10ema) [13:51:43] 10Operations: Conffile handling for PHP 7.2 packages - https://phabricator.wikimedia.org/T231881 (10MoritzMuehlenhoff) [13:52:06] 10Operations: Conffile handling for PHP 7.2 packages - https://phabricator.wikimedia.org/T231881 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:54:53] (03PS1) 10Ema: VCL: only cache responses with explicit CL on upload [puppet] - 10https://gerrit.wikimedia.org/r/534156 (https://phabricator.wikimedia.org/T231422) [13:55:55] (03CR) 10Gilles: [C: 03+1] VCL: only cache responses with explicit CL on upload [puppet] - 10https://gerrit.wikimedia.org/r/534156 (https://phabricator.wikimedia.org/T231422) (owner: 10Ema) [13:56:02] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:56:57] (03CR) 10Ema: [C: 03+2] VCL: only cache responses with explicit CL on upload [puppet] - 10https://gerrit.wikimedia.org/r/534156 (https://phabricator.wikimedia.org/T231422) (owner: 10Ema) [13:57:16] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [13:57:25] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Promote db1133 as wikitech master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534144 (https://phabricator.wikimedia.org/T229657) (owner: 10Marostegui) [13:58:16] hashar: ^ I have merged that (I thought the train finished), I won't deploy anyways (it is a noop change) [13:58:28] (03Merged) 10jenkins-bot: db-eqiad.php: Promote db1133 as wikitech master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534144 (https://phabricator.wikimedia.org/T229657) (owner: 10Marostegui) [13:58:30] marostegui: dont worry :) [13:58:39] marostegui: I am done with the mediawiki-config changes for now [13:58:45] (03CR) 10jenkins-bot: db-eqiad.php: Promote db1133 as wikitech master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534144 (https://phabricator.wikimedia.org/T229657) (owner: 10Marostegui) [13:59:00] the whole thing is on hold pending for some canary and I don't even know which one hehe [13:59:20] hashar: ah, ok, what should I do with wikiversions.json file on /srv/mediawiki-staging ? [13:59:56] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog: Lake Huron missing due to apparent OSM vandalism - https://phabricator.wikimedia.org/T231691 (10MusikAnimal) 05Open→03Resolved a:03MusikAnimal >>! In T231691#5456906, @Gehel wrote: > As far as I can tell, the issue is now resolved. > > @Musi... [14:00:32] marostegui: I guess you can stash it rebase, reapply [14:00:40] git stash; git remote update; git rebase [14:00:42] git stash apply [14:00:48] or just dish it out [14:00:53] and I will resync wikiversion.json later on [14:01:10] (03PS2) 10MSantos: maps: cleanup unused template [puppet] - 10https://gerrit.wikimedia.org/r/533974 (owner: 10Gehel) [14:01:16] (03CR) 10MSantos: [C: 03+1] maps: cleanup unused template [puppet] - 10https://gerrit.wikimedia.org/r/533974 (owner: 10Gehel) [14:01:39] hashar: ok, I will get rid of it then :) [14:03:36] hashar: I merged my change, and I got your wikiversions back :) [14:07:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 3 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10Nuria) I can see the cache-control: max-age=604800 , I think @ema needs to change something on his end so varnish /ATS settings apply? [14:11:08] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog: Lake Huron missing due to apparent OSM vandalism - https://phabricator.wikimedia.org/T231691 (10MSantos) >>! In T231691#5460989, @Gehel wrote: >>>! In T231691#5460894, @Haros wrote: >> We really need a way to trigger a refresh without creating a ne... [14:12:00] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:12:55] marostegui: thx :) [14:14:34] 10Operations: Conffile handling for PHP 7.2 packages - https://phabricator.wikimedia.org/T231881 (10MoritzMuehlenhoff) As a workaround the update can be deployed with the following Cumin command (still need to get to the bottom of what causes conffile prompts here): ` sudo cumin mwdebug1001* 'export DEBIAN_FRO... [14:21:50] !log upgrading app server canaries to PHP 7.2.22 T230024 [14:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:52] T230024: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 [14:22:00] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CRITICAL - load average: 79.83, 37.30, 25.85 https://wikitech.wikimedia.org/wiki/Application_servers [14:22:18] PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CRITICAL - load average: 63.43, 29.23, 20.96 https://wikitech.wikimedia.org/wiki/Application_servers [14:22:26] PROBLEM - High CPU load on API appserver on mw1313 is CRITICAL: CRITICAL - load average: 89.51, 40.94, 27.24 https://wikitech.wikimedia.org/wiki/Application_servers [14:22:44] PROBLEM - High CPU load on API appserver on mw1317 is CRITICAL: CRITICAL - load average: 76.54, 41.03, 27.41 https://wikitech.wikimedia.org/wiki/Application_servers [14:23:34] RECOVERY - High CPU load on API appserver on mw1341 is OK: OK - load average: 39.91, 37.40, 27.15 https://wikitech.wikimedia.org/wiki/Application_servers [14:23:48] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 70.48, 30.04, 19.93 https://wikitech.wikimedia.org/wiki/Application_servers [14:23:58] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 50.68, 24.78, 17.68 https://wikitech.wikimedia.org/wiki/Application_servers [14:24:00] RECOVERY - High CPU load on API appserver on mw1313 is OK: OK - load average: 34.66, 36.24, 27.01 https://wikitech.wikimedia.org/wiki/Application_servers [14:24:08] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 49.15, 23.84, 15.92 https://wikitech.wikimedia.org/wiki/Application_servers [14:24:12] PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 102.66, 41.87, 25.80 https://wikitech.wikimedia.org/wiki/Application_servers [14:24:14] moritzm: ^ you? is that just cache invalidations? [14:24:20] RECOVERY - High CPU load on API appserver on mw1317 is OK: OK - load average: 31.41, 35.30, 26.69 https://wikitech.wikimedia.org/wiki/Application_servers [14:24:24] I haven't done anything yet [14:24:34] (and will only upgrade mw1261 initially anyway) [14:24:37] ah [14:24:40] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 49.85, 27.31, 17.83 https://wikitech.wikimedia.org/wiki/Application_servers [14:25:30] RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 23.23, 31.55, 24.05 https://wikitech.wikimedia.org/wiki/Application_servers [14:25:34] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 25.10, 23.30, 17.87 https://wikitech.wikimedia.org/wiki/Application_servers [14:25:42] RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 18.52, 20.54, 15.53 https://wikitech.wikimedia.org/wiki/Application_servers [14:25:50] are these php7 API servers? I saw https://phabricator.wikimedia.org/T231011 listed in the weekly document [14:26:14] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 20.65, 23.87, 17.55 https://wikitech.wikimedia.org/wiki/Application_servers [14:26:36] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 61.84, 28.12, 20.60 https://wikitech.wikimedia.org/wiki/Application_servers [14:26:58] PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 60.17, 29.37, 22.24 https://wikitech.wikimedia.org/wiki/Application_servers [14:27:02] PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 105.93, 42.44, 25.36 https://wikitech.wikimedia.org/wiki/Application_servers [14:27:26] PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 91.14, 41.83, 26.03 https://wikitech.wikimedia.org/wiki/Application_servers [14:27:54] latency is still elevated as is mcrouter traffic [14:28:16] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [14:28:31] !log hashar@deploy1001 Finished scap: testwiki to 1.34.0-wmf.21 and rebuild l10n cache - T220746 (duration: 50m 09s) [14:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:34] RECOVERY - High CPU load on API appserver on mw1286 is OK: OK - load average: 28.28, 27.71, 22.40 https://wikitech.wikimedia.org/wiki/Application_servers [14:28:34] T220746: 1.34.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T220746 [14:28:58] RECOVERY - High CPU load on API appserver on mw1285 is OK: OK - load average: 15.92, 28.07, 24.53 https://wikitech.wikimedia.org/wiki/Application_servers [14:29:06] oh, I didn't realize hashar was rebuilding l10n cache [14:29:12] that often causes appserver high CPU [14:31:28] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:31:48] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 12.21, 21.93, 21.97 https://wikitech.wikimedia.org/wiki/Application_servers [14:31:48] RECOVERY - High CPU load on API appserver on mw1284 is OK: OK - load average: 16.12, 28.54, 24.43 https://wikitech.wikimedia.org/wiki/Application_servers [14:32:14] RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 15.66, 27.38, 24.29 https://wikitech.wikimedia.org/wiki/Application_servers [14:32:58] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 16.52, 28.06, 25.14 https://wikitech.wikimedia.org/wiki/Application_servers [14:33:04] so [14:33:12] cdanis: yeah and it is done [14:33:20] rest is the usual bytecode cache being rebuild [14:33:33] well, there's still something else mysterious going on with the appserver fleet [14:34:12] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=11&fullscreen&orgId=1&from=now-2d&to=now [14:34:12] jouncebot: now [14:34:12] No deployments scheduled for the next 1 hour(s) and 25 minute(s) [14:34:55] cdanis: maybe that is related to 1.34.0-wmf.20 which I have deployed on rest of wikis around that time [14:35:08] FYI I'll be bold and merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/531142 no point in the high cpu alerts anymore IMHO [14:35:20] hmm n that was before [14:36:27] (03PS2) 10Filippo Giunchedi: mediawiki: remove per-host high CPU alerts [puppet] - 10https://gerrit.wikimedia.org/r/531142 (https://phabricator.wikimedia.org/T230396) [14:37:51] (03CR) 10Hashar: [C: 03+2] Group0 to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534154 (https://phabricator.wikimedia.org/T220746) (owner: 10Hashar) [14:38:17] (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: remove per-host high CPU alerts [puppet] - 10https://gerrit.wikimedia.org/r/531142 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [14:39:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Promote db1133 as wikitech master T229657 (duration: 00m 54s) [14:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:15] T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 [14:40:20] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534154 (https://phabricator.wikimedia.org/T220746) (owner: 10Hashar) [14:40:38] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534154 (https://phabricator.wikimedia.org/T220746) (owner: 10Hashar) [14:41:34] eeek [14:41:39] marostegui: ah I am pushing group0 [14:41:42] but that should be fast [14:42:08] wait for canaries [14:42:40] hashar: I'm fully done [14:44:20] marostegui: congratulations [14:45:05] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [14:45:40] even sync wikiversions takes age :-\ [14:45:43] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.21 - T220746 [14:45:44] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [14:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:46] T220746: 1.34.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T220746 [14:45:47] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) 05Open→03Resolved This is all done - db1073 will be decommissioned in a few days (most... [14:46:42] PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:47:18] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:47:28] bajh [14:47:34] :\ [14:47:56] PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [14:48:18] anyone familiar with Graphoid? [14:48:38] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:48:48] seems that as part of rolling 1.34.0-wmf.21 that broke Graphoid somehow :-\ [14:49:18] 10Operations, 10DBA: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) [14:49:36] PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:49:48] I don't even know where that health check is defined :\ [14:50:05] 10Operations, 10DBA: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) [14:50:10] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [14:50:20] 10Operations, 10DBA: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) [14:50:23] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [14:50:57] 10Operations, 10DBA: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 (10Marostegui) p:05Triage→03Normal This host was just removed from being a master [T229657] let's give it a few more days before actually start its decommissioning process. [14:51:03] hashar: if you can find mobrovac he might know how to debug. Second best bet would be akosiaris I think [14:51:27] thanks bd808 :) [14:53:21] Load failed with response code 40 [14:53:22] 403 [14:53:24] err [14:53:44] this seems to be a mw-side issue [14:54:04] graphoid gets back the "invalidhash" error from mw api [14:54:38] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:54:40] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:54:42] for action=graph&title=Extension:Graph/Demo&hash=2e25518b199b22ab9043f7ce9a0cd1370b27d77a [14:54:46] hashar: ^ [14:55:06] so I guess time for me to rollback group0 ;] [14:56:50] (03PS3) 10Marostegui: wmnet: Update s8-master record [dns] - 10https://gerrit.wikimedia.org/r/531455 (https://phabricator.wikimedia.org/T230762) [14:56:56] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:57:01] (03PS4) 10Marostegui: mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/531189 (https://phabricator.wikimedia.org/T230762) [14:57:02] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:57:07] (03PS1) 10Hashar: Revert "Group0 to 1.34.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534171 (https://phabricator.wikimedia.org/T220746) [14:57:19] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Rollback group0 to 1.34.0-wmf.21 - T220746 [14:57:19] (03CR) 10Hashar: [C: 03+2] "Already rollbacked" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534171 (https://phabricator.wikimedia.org/T220746) (owner: 10Hashar) [14:57:20] RECOVERY - graphoid endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:22] T220746: 1.34.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T220746 [14:57:26] RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:57:30] RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:57:40] IIRC there was some graph extension related patch merged [14:57:48] RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [14:57:50] lemme verify that [14:58:12] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [14:58:14] filling a bug [14:58:15] (03Merged) 10jenkins-bot: Revert "Group0 to 1.34.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534171 (https://phabricator.wikimedia.org/T220746) (owner: 10Hashar) [14:58:26] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Kartographer/+/531159/ [14:58:33] (03CR) 10jenkins-bot: Revert "Group0 to 1.34.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534171 (https://phabricator.wikimedia.org/T220746) (owner: 10Hashar) [14:58:43] that would probably break graphoid I guess ? [14:58:55] RoanKattouw: ^ ? [14:59:19] It should not affect Graphoid, it was in the Graph extension [14:59:31] action=graph is the graph extension [14:59:33] Although, huh, invalidhash? Interesting [14:59:47] Maybe it's yet another bit of fallout from my Graph extension change :( [14:59:53] there are no mw logs for that has id that i could find [15:00:01] mobrovac: akosiaris I have filled it as https://phabricator.wikimedia.org/T231894 [15:00:04] s/has/hash/ [15:00:09] havent filled a lot of details though ;-\ [15:00:13] hashar: ok thanks [15:00:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, although I think this is the kind of cook book which warrants a similar sanity check like Cumin? In Cumin is prints the host m" [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [15:01:20] mobrovac: akosiaris and I have rollbacked so at least we are safe "tm" [15:01:24] Hmm perhaps it's caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Graph/+/493628 which is a much larger change than mine, and is new in wmf.21 [15:02:09] (03PS10) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 [15:02:17] I am going to promote "testwiki" to 1.34.0-wmf.21 though no idea whether graphoid is enabled there [15:02:29] ah yeah it is [15:03:51] if do promote it, the checks won't start failing for graphoid because it only uses mw.org for that [15:03:59] Wait, I'm looking for the patch that fixed the object-instead-of-array issue but I can't find it. Maybe that only affected Kartographer, not Graph? [15:04:04] but since we know .21 causes problems, i'm not sure you should promote it [15:04:06] (03PS1) 10Hashar: Promote testwiki to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534173 (https://phabricator.wikimedia.org/T231894) [15:04:31] (03CR) 10Hashar: [C: 03+2] Promote testwiki to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534173 (https://phabricator.wikimedia.org/T231894) (owner: 10Hashar) [15:04:49] Hmm actually [15:04:58] I am just promoting testwiki [15:05:03] I wonder if Graphoid is really broken or if it's an assumption in the check that needs to be updated [15:05:10] no idea whether that would help to debug the graphoid thing though [15:05:27] (03Merged) 10jenkins-bot: Promote testwiki to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534173 (https://phabricator.wikimedia.org/T231894) (owner: 10Hashar) [15:05:41] Where can I find the HTTP request that the monitoring code makes? The wikitech link in the icinga-wm message is dead [15:06:06] check_wmf_services in puppet [15:06:25] singular [15:06:25] bah [15:06:41] (03CR) 10jenkins-bot: Promote testwiki to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534173 (https://phabricator.wikimedia.org/T231894) (owner: 10Hashar) [15:06:43] check_wmf_service!http://graphoid.svc.codfw.wmnet:19000!15 [15:07:14] /usr/bin/service-checker-swagger -t $ARG2$ $HOSTNAME$ $ARG1$ [15:07:14] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: testwiki 1.34.0-wmf.21 for T231894 - T220746 [15:07:16] bah :-\ [15:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:18] T231894: 1.34.0-wmf.21 cause Graphoid service check to fail due to 403 from mediawiki.org - https://phabricator.wikimedia.org/T231894 [15:07:18] T220746: 1.34.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T220746 [15:07:42] RoanKattouw: curl http://graphoid.svc.codfw.wmnet:19000/?spec | jq . [15:07:48] it's the x-amples stanza [15:07:53] Thanks [15:07:56] stanzas more like it [15:08:33] RoanKattouw: https://github.com/wikimedia/mediawiki-services-graphoid/blob/master/spec.yaml#L67 [15:08:36] more or less I see "request": { "params": { "format": "png", "title": "Extension:Graph/Demo", "revid": "0", "id": "2e25518b199b22ab9043f7ce9a0cd1370b27d77a"} [15:09:29] I will let you guys find out the magic. I am getting a breka and be back later for dinner [15:09:48] but I guess anyone from #wikimedia-releng should be able to promote to 1.34.0-wmf.21 again if need be [15:09:50] I have a suspicion, first going to try to confirm it locally [15:10:30] so the monitoring script tries to query /mediawiki.org/v1/png/Extension:Graph/Demo/0/2e25518b199b22ab9043f7ce9a0cd1370b27d77a [15:10:40] what that internally means for graphoid's requests, I am not sure [15:11:32] PROBLEM - Nginx local proxy to apache on mw1287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:12:23] 10Operations, 10Traffic: Cannot download STL files due to network error - https://phabricator.wikimedia.org/T231422 (10ema) @Gilles: the issue should now be fixed, can you confirm? [15:12:24] RoanKattouw: note btw that graphoid is under a code stewardship requests per https://phabricator.wikimedia.org/T211881. Getting changes to it, might very well prove challenging [15:12:37] I don't think we're going to need to change Graphoid itself [15:12:42] I think I screwed something up in the Graph extension [15:13:01] ok, that makes it way easier then [15:13:06] RECOVERY - Nginx local proxy to apache on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 593 bytes in 1.994 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:15:45] Yup I found it, patch incoming [15:16:20] Same mistake as I made in Kartographer, except in Graph it somehow didn't cause PHP fatals, instead it just returned "invalidhash" for everything [15:18:27] 10Operations, 10Traffic: Cannot download STL files due to network error - https://phabricator.wikimedia.org/T231422 (10Gilles) 05Open→03Resolved Fix confirmed [15:19:18] Patch in Gerrit and +2ed: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Graph/+/534174 [15:20:05] I'll cherry-pick and deploy once it makes it through Jenkins, unless someone beats me to it [15:21:07] anyone to assist for a big rename? [15:21:10] 150k [15:21:37] * Vito eyes RoanKattouw [15:22:08] I've never done one before, is there documentation about what I should do? [15:23:01] RoanKattouw: if some rename gets stuck there should be a script to restart it [15:23:48] Aha, fixStuckGlobalRename.php sounds relevant [15:24:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] First version of the wikifeeds chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287) (owner: 10MSantos) [15:24:57] RoanKattouw: I think also nukeTehWikis.php should be of interest [15:25:01] haha [15:25:17] Vito: Could you link me to the log entry for the stuck rename please? [15:25:33] currently I didn't perform the rename yet [15:25:59] but we prefer to have some sysadmin around while doing big renames [15:26:29] Oh I see [15:27:09] I'm about to take a shower then go to the office, but I should be back on line in about an hour [15:27:49] (Sorry, I know it looks like I'm working, but that's only because I got up at 6:45am for an early meeting and then got a train blocker dropped in my lap) [15:28:21] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic: (OoW) lvs2006 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T192082 (10Papaul) p:05Normal→03Low [15:28:46] 10Operations, 10ops-codfw: (OoW) lvs2002 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T148017 (10Papaul) p:05Normal→03Low [15:29:13] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (OoW) lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10Papaul) p:05Lowest→03Low [15:30:41] no pb RoanKattouw, I'll take a look at rename status until completed [15:30:56] OK cool, and if it gets stuck feel free to ping me [15:31:20] I just might not respond immediately [15:32:43] !log unban elastic1027 from production-search-eqiad [15:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:39] (03PS1) 10Ladsgroup: Set item terms migration stage for Wikidata on WRITE_BOTH up to Q2m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534183 (https://phabricator.wikimedia.org/T225055) [15:38:22] (03PS1) 10Ema: envoyproxy: allow overriding tls_port [puppet] - 10https://gerrit.wikimedia.org/r/534184 [15:39:32] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: allow overriding tls_port [puppet] - 10https://gerrit.wikimedia.org/r/534184 (owner: 10Ema) [15:39:34] (03PS1) 10Reedy: Re-apply wgFlaggedRevsOverride = false on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534185 (https://phabricator.wikimedia.org/T227260) [15:40:01] (03PS2) 10Reedy: Re-apply wgFlaggedRevsOverride = false on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534185 (https://phabricator.wikimedia.org/T227260) [15:40:07] jouncebot: now [15:40:07] No deployments scheduled for the next 0 hour(s) and 19 minute(s) [15:40:12] jouncebot: next [15:40:12] In 0 hour(s) and 19 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T1600) [15:41:52] (03PS2) 10Ema: envoyproxy: allow overriding tls_port [puppet] - 10https://gerrit.wikimedia.org/r/534184 [15:43:07] (03CR) 10Reedy: [C: 03+2] Re-apply wgFlaggedRevsOverride = false on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534185 (https://phabricator.wikimedia.org/T227260) (owner: 10Reedy) [15:43:52] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: allow overriding tls_port [puppet] - 10https://gerrit.wikimedia.org/r/534184 (owner: 10Ema) [15:49:15] (03PS3) 10Ema: envoyproxy: allow overriding TLS port [puppet] - 10https://gerrit.wikimedia.org/r/534184 [15:50:44] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10Mathew.onipe) rsyslog Json requires the `@cee` token which must be provided accordin... [15:51:19] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: allow overriding TLS port [puppet] - 10https://gerrit.wikimedia.org/r/534184 (owner: 10Ema) [15:54:07] (03Merged) 10jenkins-bot: Re-apply wgFlaggedRevsOverride = false on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534185 (https://phabricator.wikimedia.org/T227260) (owner: 10Reedy) [15:54:24] (03CR) 10jenkins-bot: Re-apply wgFlaggedRevsOverride = false on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534185 (https://phabricator.wikimedia.org/T227260) (owner: 10Reedy) [15:55:40] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T227260 (duration: 00m 54s) [15:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:44] T227260: Last reviewed revision is shown to logged-out users in ukwiki instead of last revision - https://phabricator.wikimedia.org/T227260 [15:56:47] RoanKattouw: everything worked fine, ty [15:57:52] 10Operations, 10Release Pipeline, 10Maps (Kartotherian), 10Patch-For-Review: Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10Jdforrester-WMF) I believe that this is now Done? [16:00:04] godog and _joe_: My dear minions, it's time we take the moon! Just kidding. Time for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:01] !log joal@deploy1001 Finished deploy [analytics/refinery@8b17711]: Fixes for regualr analytics deploy (duration: 136m 59s) [16:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:32] (03PS4) 10Ema: envoyproxy: allow overriding TLS port [puppet] - 10https://gerrit.wikimedia.org/r/534184 [16:01:42] 10Operations, 10Epic, 10Maps (Kartotherian), 10Patch-For-Review: Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10MSantos) [16:01:45] 10Operations, 10Release Pipeline, 10Maps (Kartotherian), 10Patch-For-Review: Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10MSantos) 05Open→03Resolved [16:03:41] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: allow overriding TLS port [puppet] - 10https://gerrit.wikimedia.org/r/534184 (owner: 10Ema) [16:12:16] (03PS1) 10Fdans: role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/534191 [16:15:03] CI is completely broken in the wmf.21 branch, we'll need releng to fix that before my train blocker fix can be deployed [16:16:31] Or just force-merge it? [16:19:31] (03CR) 10Ayounsi: [C: 03+2] CLI: suppress ncclient noisy logger [software/homer] - 10https://gerrit.wikimedia.org/r/533570 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [16:21:45] (03CR) 10Phamhi: "Puppet successful compile result can be found here: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/18131/cons" [puppet] - 10https://gerrit.wikimedia.org/r/533606 (owner: 10Phamhi) [16:22:17] (03CR) 10Ayounsi: [C: 03+2] config: inject role and site to the configuration [software/homer] - 10https://gerrit.wikimedia.org/r/533568 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [16:23:11] Sure, I guess [16:23:19] (03PS5) 10Ema: envoyproxy: allow overriding TLS port [puppet] - 10https://gerrit.wikimedia.org/r/534184 [16:25:04] RoanKattouw: It's also "only" UBN because it blocks the train, it's just group0 graphs that might be broken. [16:25:18] I know [16:25:33] But it breaking graphs on mw.org caused icinga to declare all of Graphoid to be broken [16:27:11] (03CR) 10Ema: "pcc lgtm https://puppet-compiler.wmflabs.org/compiler1002/18155/miscweb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/534184 (owner: 10Ema) [16:27:41] * James_F nods. [16:29:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Verify it doesn't impact what we have installed already with the compiler, but it seems GTG to me." [puppet] - 10https://gerrit.wikimedia.org/r/534184 (owner: 10Ema) [16:34:59] ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/npm-test [16:35:05] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.21/extensions/Graph/includes/ApiGraph.php: T231894 (duration: 00m 55s) [16:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:09] T231894: 1.34.0-wmf.21 cause Graphoid service check to fail due to 403 from mediawiki.org - https://phabricator.wikimedia.org/T231894 [16:35:29] Could not find the registration file for the extension … [16:35:34] This --^^ should fix the train blocker and make it safe to reenable wmf.21 on group0 without breaking the Graphoid monitoring check [16:35:35] RoanKattouw: those are the errors on wmf.21 in Jenkins [16:35:37] strange indeed [16:35:47] Krinkle: Yes, extension loading is completely broken in wmf.21 CI right now [16:35:56] Which is why I force-merged the patch I just deployed [16:36:04] Also why is it cloning a think called "npm-test.git" that repo doesn't exist [16:36:07] buggy quibble release? [16:36:38] (-releng) [16:40:08] (03CR) 10Ayounsi: transports: add JunOS transport (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [16:55:07] (03PS1) 10Odder: Add high-density logos for the Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534197 (https://phabricator.wikimedia.org/T230122) [16:57:40] (03PS2) 10Odder: Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534152 (https://phabricator.wikimedia.org/T230120) [16:59:17] (03CR) 10Ayounsi: [C: 03+1] config: enforce positional vs. keyword args [software/homer] - 10https://gerrit.wikimedia.org/r/533623 (owner: 10Volans) [16:59:56] (03PS1) 10Odder: Add high-density logos for the Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534198 (https://phabricator.wikimedia.org/T230122) [17:00:04] cscott, arlolra, subbu, halfak, and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T1700). [17:02:27] (03CR) 10jerkins-bot: [V: 04-1] Add high-density logos for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534152 (https://phabricator.wikimedia.org/T230120) (owner: 10Odder) [17:02:39] (03CR) 10jerkins-bot: [V: 04-1] Add high-density logos for the Incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534198 (https://phabricator.wikimedia.org/T230122) (owner: 10Odder) [17:14:51] Hi opsens. Do you know what's wmf-nda, ie. who should be there? [17:18:10] no parsoid deploy today [17:22:42] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Patch-For-Review: Reindex commonswiki as shards have grown beyond critical threshold - https://phabricator.wikimedia.org/T231446 (10Gehel) [17:24:06] 10Operations, 10netops: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 (10ayounsi) Hosts in the `cloud-hosts1-b-eqiad` vlan are behind the `labs-in4` firewall filter (applied on traffic going out of that vlan), which also includes the... [17:30:39] Urbanecm: do you mean the Phabricator group? [17:30:57] Nemo_bis: yes [17:31:56] Urbanecm: well, isn't it described on the project itself? https://phabricator.wikimedia.org/project/profile/61/ [17:32:19] mostly https://wikitech.wikimedia.org/wiki/Ops_Onboarding + https://wikitech.wikimedia.org/wiki/Volunteer_NDA cover it [17:32:23] James_F: I guess I can try to promote 1.34.0-wmf.21 again now that Graph is solved ;) [17:32:29] jouncebot: now [17:32:29] For the next 0 hour(s) and 27 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T1700) [17:35:28] (03PS1) 10Hashar: Group0 to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534203 (https://phabricator.wikimedia.org/T220746) [17:35:53] (03CR) 10Hashar: [C: 03+2] "Take 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534203 (https://phabricator.wikimedia.org/T220746) (owner: 10Hashar) [17:35:59] (03PS15) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) [17:36:01] (03PS2) 10Mathew.onipe: elasticsearch: add syslog logging option [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) [17:37:30] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534203 (https://phabricator.wikimedia.org/T220746) (owner: 10Hashar) [17:38:07] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534203 (https://phabricator.wikimedia.org/T220746) (owner: 10Hashar) [17:38:18] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Cmjohnson) @Ottomata Do you still need the 2nd port now that you're not doing the cloud thing? If so which... [17:38:44] (03CR) 10Hashar: [C: 03+2] "That is 1.34.0-wmf.21 !" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534203 (https://phabricator.wikimedia.org/T220746) (owner: 10Hashar) [17:40:53] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.21 [17:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:56] !log Pulled I9b64a2bb770 into wmf.21 production on the deploy server; no need to deploy to app-servers, CI-only fix. [17:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:45] (03PS2) 10Herron: prometheus: aggregate systemd failed metrics [puppet] - 10https://gerrit.wikimedia.org/r/533282 (https://phabricator.wikimedia.org/T230570) [18:03:34] (03CR) 10Herron: prometheus: aggregate systemd failed metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/533282 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [18:06:10] (03PS1) 10Herron: prometheus: deploy prometheus-ipsec-exporter to all sites [puppet] - 10https://gerrit.wikimedia.org/r/534210 (https://phabricator.wikimedia.org/T230236) [18:18:17] (03CR) 10Eevans: [C: 03+1] sessionstore: Bump limits and requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/533922 (https://phabricator.wikimedia.org/T229697) (owner: 10Alexandros Kosiaris) [18:20:24] (03PS3) 10Herron: eventgate-main: add new kafka-main brokers to broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/529428 (https://phabricator.wikimedia.org/T225005) [18:21:42] (03CR) 10Herron: "based on the wikitech docs the chart version will need to be bumped, added that to ps3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/529428 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [18:23:31] (03PS7) 10MSantos: First version of the wikifeeds chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287) [18:25:02] (03PS1) 10Subramanya Sastry: Enable loading Parsoid/PHP as an extension on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534215 (https://phabricator.wikimedia.org/T231569) [18:25:43] (03CR) 10MSantos: "> Patch Set 6:" (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287) (owner: 10MSantos) [18:26:12] (03CR) 10Subramanya Sastry: "Maybe wait till the Parsoid server on the beta cluster is converted to an appserver. But this should be safe nevertheless." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534215 (https://phabricator.wikimedia.org/T231569) (owner: 10Subramanya Sastry) [18:27:18] (03CR) 10Subramanya Sastry: "We should figure out if we need any other custom config file similar to the rt test settings file. I already flagged https://phabricator.w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534215 (https://phabricator.wikimedia.org/T231569) (owner: 10Subramanya Sastry) [18:30:48] Nemo_bis: but it doesn't say who should be in [18:30:50] jouncebot: now [18:30:50] No deployments scheduled for the next 4 hour(s) and 29 minute(s) [18:30:56] jouncebot: next [18:30:56] In 4 hour(s) and 29 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T2300) [18:35:51] !log Livetesting on mwdebug1002 [18:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:50] * Urbanecm going to sync the livetest, works, will upload soon [18:41:40] !log urbanecm@deploy1001 Synchronized wmf-config/: Emergency fix: GE not loading configuration properly: newbie facing feature (duration: 00m 57s) [18:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:05] (03PS1) 10Urbanecm: [bugfix] Growth experiments not loading conf properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534217 [18:43:31] (03CR) 10Urbanecm: [C: 03+2] "already deployed, emergency" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534217 (owner: 10Urbanecm) [18:43:33] (03PS2) 10Catrope: Enable and configure ORES damaging and goodfaith on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528537 (https://phabricator.wikimedia.org/T225562) (owner: 10Sbisson) [18:44:49] (03CR) 10jerkins-bot: [V: 04-1] [bugfix] Growth experiments not loading conf properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534217 (owner: 10Urbanecm) [18:44:59] RoanKattouw: ^^ [18:45:07] (03CR) 10jerkins-bot: [V: 04-1] [bugfix] Growth experiments not loading conf properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534217 (owner: 10Urbanecm) [18:45:08] that V-1'ed patch is emergency-deployed fix [18:45:10] Urbanecm: it's ops, plus staff who requests [possibly documented on a private page, I don't remember], plus volunteers who request it [18:45:17] going to fix jenkins [18:45:21] Nemo_bis: thanks [18:45:59] I think there is no unifying policy in order to let space for some ad hoc decisions, but I might misremember [18:46:21] Urbanecm: What's broken about it? Is it extending the array instead of overwriting it completely? [18:46:25] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) We don't! Just Analytics VLAN for now please. [18:46:34] RoanKattouw: yes [18:46:42] it showed the site help link [18:46:43] as the first one [18:46:51] not sure who added it into extension.json for the first place [18:46:52] Right. The long-term solution then probably involves changing the merge_strategy of the setting in extension.json [18:46:55] yes [18:46:59] but this is emergency solution [18:47:01] user facing [18:47:13] I'll fix the patch [18:47:22] What link? [18:47:54] Urbanecm: Also you probably don't need $wgExtensionFunctions here, you should be able to override the $wg vars directly. That's how almost all (maybe actually all?) of the existing $wmg vars work [18:48:08] https://gerrit.wikimedia.org/r/534217 [18:48:09] my fix [18:48:12] it has -1 from jenkins [18:48:13] lint [18:48:25] (03PS2) 10Urbanecm: [bugfix] Growth experiments not loading conf properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534217 [18:48:27] RoanKattouw: probably... [18:48:32] not sure, lazy to test now :-) [18:48:39] fixed, because it's urgent IMO [18:48:46] Yeah no problem, it should be temporary anyway [18:48:48] yeah [18:49:11] Please file a task for a long-term solution too, ideally with an explanation of what broke and where [18:49:38] will do [18:49:49] i was quite sure it's extreg anyway, so this was easy to test [18:50:27] (03CR) 10Catrope: [C: 03+2] [bugfix] Growth experiments not loading conf properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534217 (owner: 10Urbanecm) [18:50:37] RoanKattouw: thanks for review [18:50:45] I'll fill a task soon [18:50:51] and try the merging strategy... [18:51:54] (03Merged) 10jenkins-bot: [bugfix] Growth experiments not loading conf properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534217 (owner: 10Urbanecm) [18:52:13] (03CR) 10jenkins-bot: [bugfix] Growth experiments not loading conf properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534217 (owner: 10Urbanecm) [19:07:50] (03CR) 10Ottomata: [C: 03+2] role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/534191 (owner: 10Fdans) [19:09:17] (03CR) 10Ottomata: [C: 03+2] Switch high-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529124 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [19:09:25] (03PS2) 10Ottomata: Switch high-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529124 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [19:12:10] (03CR) 10jenkins-bot: Switch high-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529124 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [19:12:42] !log switching jobqueue events to eventgate-main - T228705 [19:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:45] T228705: Migrate JobQueue to eventgate - https://phabricator.wikimedia.org/T228705 [19:14:00] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch high-traffic jobs to eventgate. Take 2 - T228705 (duration: 00m 56s) [19:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:26] !log fdans@deploy1001 Started restart [analytics/aqs/deploy@fc1d232]: (no justification provided) [19:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:00] !log fdans@deploy1001 Started restart [analytics/aqs/deploy@fc1d232]: (no justification provided) [19:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:27] (03CR) 10Jhedden: [C: 03+1] admin: convert maintain_kubeusers to systemd timer type [puppet] - 10https://gerrit.wikimedia.org/r/533606 (owner: 10Phamhi) [19:23:30] 10Operations, 10Packaging, 10serviceops, 10CPT Initiatives (Session Management Service (CDP2)): Need help to create and deploy Debian-packaged Python 3 app - https://phabricator.wikimedia.org/T229980 (10WDoranWMF) [19:42:10] !log rollback OSPF metric change on eqiad-codfw Zayo link (1320->320) [19:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:51] (03PS1) 10Ppchelko: Switch all non-low-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534225 (https://phabricator.wikimedia.org/T228705) [19:55:52] (03CR) 10jerkins-bot: [V: 04-1] Switch all non-low-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534225 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [20:12:43] (03CR) 10Cwhite: [C: 03+1] prometheus: deploy prometheus-ipsec-exporter to all sites [puppet] - 10https://gerrit.wikimedia.org/r/534210 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [20:22:29] (03PS1) 10CDanis: dbctl: indicate failed commit in announcement [software/conftool] - 10https://gerrit.wikimedia.org/r/534230 (https://phabricator.wikimedia.org/T231871) [20:44:44] 10Operations, 10ops-esams, 10netops: replace msw1-esams - https://phabricator.wikimedia.org/T185151 (10ayounsi) [20:44:46] 10Operations, 10ops-esams: Repurpose csw2-oe14/15 and lab-ex4200 as msw - https://phabricator.wikimedia.org/T215991 (10ayounsi) [21:05:46] (03PS2) 10Ppchelko: Switch all non-low-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534225 (https://phabricator.wikimedia.org/T228705) [21:06:41] (03CR) 10jerkins-bot: [V: 04-1] Switch all non-low-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534225 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [21:10:14] (03PS1) 10Ppchelko: Remove EventBusRCFeedEngine eventServiceName. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534236 (https://phabricator.wikimedia.org/T229863) [21:12:35] (03PS3) 10Ppchelko: Switch all non-low-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534225 (https://phabricator.wikimedia.org/T228705) [21:17:48] (03PS13) 10Jforrester: Set wgNoticeProjects for wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694) (owner: 10MacFan4000) [21:32:08] (03Abandoned) 10Subramanya Sastry: Make scandium a read-only appserver + enable exception logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529173 (https://phabricator.wikimedia.org/T228069) (owner: 10Subramanya Sastry) [21:34:13] (03PS1) 10Catrope: Revert "[bugfix] Growth experiments not loading conf properly" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534240 (https://phabricator.wikimedia.org/T231935) [21:34:53] (03CR) 10Catrope: [C: 04-2] "Wait for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/534233 to be deployed first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534240 (https://phabricator.wikimedia.org/T231935) (owner: 10Catrope) [21:55:22] RoanKattouw: you deploying "Set correct merge strategy for help panel links"? [21:55:26] https://gerrit.wikimedia.org/r/534242 [21:58:07] Yes, I listed it for the SWAT in about an hour [21:59:08] yeah, evening one. Thanks, just wasn't sure how permanent your -2 is [21:59:11] thanks! [22:00:30] Oh it was just until the extension patch got merged, which I expected to take longer [22:01:04] yeah, got it now :-) [22:01:06] thanks [22:01:29] (03CR) 10Urbanecm: [C: 03+1] "Formally: LGTM, change shouldn't break anything now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534240 (https://phabricator.wikimedia.org/T231935) (owner: 10Catrope) [22:01:42] voted +1 then, seems good and not breaking, once your -2 is resolved [22:08:44] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [22:11:54] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:12:11] (03PS50) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [22:14:15] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [22:16:12] (03PS51) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [22:28:08] (03CR) 10Awight: [C: 03+1] "> Uploaded patch set 13." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694) (owner: 10MacFan4000) [22:33:11] 10Operations, 10DBA: Migrate MySQLs to use ROW-based replication - https://phabricator.wikimedia.org/T109179 (10TK-999) >>! In T109179#3952629, @jcrespo wrote: > This is important, but not a goal for this quarter- we are still blocked on mediawiki extension maintainers to be compatible with it; however, all da... [22:40:36] (03CR) 10CRusnov: "Looks ilke I elminated the changes to unprepared postgres servers, and the compiler output looks good." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [22:43:38] (03CR) 10Jforrester: [C: 03+2] Set wgNoticeProjects for wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694) (owner: 10MacFan4000) [22:46:58] (03PS14) 10Jforrester: Set wgNoticeProjects for wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694) (owner: 10MacFan4000) [22:47:03] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694) (owner: 10MacFan4000) [22:50:41] (03Merged) 10jenkins-bot: Set wgNoticeProjects for wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694) (owner: 10MacFan4000) [22:50:57] (03CR) 10jenkins-bot: Set wgNoticeProjects for wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471663 (https://phabricator.wikimedia.org/T208694) (owner: 10MacFan4000) [22:54:02] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T208694 Set CentralNotice's wgNoticeProjects for wikimedia (duration: 00m 59s) [22:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:13] T208694: wgNoticeProjects should not default to wikimedia projects - https://phabricator.wikimedia.org/T208694 [22:54:29] James_F: thanks!!! ^ [22:54:43] AndyRussG: Seems fine, as expected. [22:54:57] Hopefully after the related patch lands in production it'll be the same. :-) [22:55:16] James_F: yeah! Thanks for doing this, now we can deploy that and other accumulated stuf!! :) [22:55:27] * James_F nods. [22:55:42] (I surrender the conch.) [23:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190903T2300). Please do the needful. [23:00:05] davidwbarratt and RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:13] here! [23:00:16] I can swat. [23:00:43] * Niharika flexes fingers [23:03:21] RoanKattouw: Around? [23:03:39] Yes, coming [23:04:08] (03PS3) 10Niharika29: Enable and configure ORES damaging and goodfaith on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528537 (https://phabricator.wikimedia.org/T225562) (owner: 10Sbisson) [23:04:29] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528537 (https://phabricator.wikimedia.org/T225562) (owner: 10Sbisson) [23:05:05] (03CR) 10Andrew Bogott: [C: 03+1] "Looks right to me! Is this in use already anywhere, such that we have to update things before/after merging?" [puppet] - 10https://gerrit.wikimedia.org/r/533758 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [23:09:04] OK, here now, sorry [23:09:16] Niharika: That ORES patch needs a table creation and a maintenance script as well [23:10:06] !log production-search-eqiad all indices index.merge.policy.deletes_pct_allowed=20 [23:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:14] RoanKattouw: No problem. Zuul is kinda sluggish today; will take some time. [23:10:41] I'm surprised a config patch is taking this long, don't those usually get merged in under a minute? [23:11:17] (03Merged) 10jenkins-bot: Enable and configure ORES damaging and goodfaith on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528537 (https://phabricator.wikimedia.org/T225562) (owner: 10Sbisson) [23:11:35] (03CR) 10jenkins-bot: Enable and configure ORES damaging and goodfaith on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528537 (https://phabricator.wikimedia.org/T225562) (owner: 10Sbisson) [23:11:37] And merged! [23:13:20] RoanKattouw: Which maintenance script and table creation? [23:13:40] Niharika: See the two "mwscript" bullet points at https://www.mediawiki.org/wiki/ORES/RCFilters#Deploying_ORES+RCFilters_to_a_new_wiki [23:16:09] RoanKattouw: Table done. For the populatedatabase script, it is normal for the script to throw up a ton of `ScoreFetcher errored for 55947634: No model available for [goodfaith] [23:16:09] ScoreFetcher errored for 55947635: No model available for [goodfaith]`? That one finished too. [23:17:04] And that change is on mwdebug1002 now if you can test it. [23:18:34] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [23:20:09] aware of ^, should work itself out in a sec [23:23:52] Niharika: Yes that is normal sadly [23:24:00] Sorry, I got distracted, I blame Greg [23:24:04] Will test now [23:25:39] davidwbarratt: Tchanders: The change is live on mwdebug1002. [23:26:35] Niharika: Re-ran the population script and it worked (weird). It's working now, good to sync [23:27:03] Okay cool. [23:28:34] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable and configure ORES damaging and goodfaith on zhwiki T225562 (duration: 00m 58s) [23:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:37] T225562: Deploy ORES filters for zhwiki - https://phabricator.wikimedia.org/T225562 [23:29:20] thanks! [23:31:12] RoanKattouw: Safe to take your -2 off of https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/534240/ now, I reckon. [23:31:28] Whoops yes sorry [23:31:29] Done [23:31:52] (03PS2) 10Niharika29: Revert "[bugfix] Growth experiments not loading conf properly" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534240 (https://phabricator.wikimedia.org/T231935) (owner: 10Catrope) [23:31:54] Although the wmf.20 / wmf.21 patches need to be deployed first [23:32:16] the "Fix merge_strategy for GrowthExperiments help links" patches [23:32:44] RoanKattouw: It won't let me merge this one until those are merged? Or can I merge it but sync it after those are out? [23:32:58] Merge before but sync after would work [23:33:19] Cool. It's taking long to merge today so I'm trying to save time. :) [23:33:47] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534240 (https://phabricator.wikimedia.org/T231935) (owner: 10Catrope) [23:35:58] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [23:37:47] Niharika: Looks good, thanks [23:38:11] Alright, let's sync it. [23:41:09] !log niharika29@deploy1001 Synchronized php-1.34.0-wmf.20/includes/block: Allow CompositeBlock::appliesToRight to return null when unsure T229417, T231145 (duration: 00m 55s) [23:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:30] !log niharika29@deploy1001 Synchronized php-1.34.0-wmf.20/tests/phpunit/: Allow CompositeBlock::appliesToRight to return null when unsure T229417, T231145 (duration: 00m 57s) [23:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:33] davidwbarratt: Tchanders: Done! :) [23:43:46] YAY! [23:44:17] '\o/' [23:44:36] (03Merged) 10jenkins-bot: Revert "[bugfix] Growth experiments not loading conf properly" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534240 (https://phabricator.wikimedia.org/T231935) (owner: 10Catrope) [23:46:52] (03CR) 10jenkins-bot: Revert "[bugfix] Growth experiments not loading conf properly" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534240 (https://phabricator.wikimedia.org/T231935) (owner: 10Catrope) [23:47:14] RoanKattouw: Can you test https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/GrowthExperiments/+/534243/ or https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/GrowthExperiments/+/534242/? [23:48:26] Niharika: I can, but I'll need to test them together with the config patch. Is that also on mwdebug? [23:48:49] RoanKattouw: One moment. [23:49:27] RoanKattouw: Now they are. [23:49:31] OK, testing [23:50:19] Looks like it works, good to deploy [23:50:28] Niharika: Please mind the sync order when you deploy these [23:50:46] RoanKattouw: To confirm, `extension.json` patches first? [23:50:50] 1) wmf.20 + wmf.21 patches; 2a) InitialiseSettings.php change in the config patch; 2b) CommonSettings.php [23:50:57] Got it. [23:52:49] !log niharika29@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/GrowthExperiments/: Set correct merge strategy for help panel links T231935 (duration: 00m 56s) [23:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:56] T231935: GrowthExperiments not loading extension.json properly - https://phabricator.wikimedia.org/T231935 [23:53:30] (03CR) 10CRusnov: [C: 03+2] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [23:53:58] !log niharika29@deploy1001 Synchronized php-1.34.0-wmf.21/extensions/GrowthExperiments/: Set correct merge strategy for help panel links T231935 (duration: 00m 55s) [23:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:04] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert - [bugfix]Growth experiments not loading conf properly T231935 (duration: 00m 55s) [23:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:20] (03PS2) 10Tim Starling: Add extra key for tstarling [puppet] - 10https://gerrit.wikimedia.org/r/533125 [23:57:00] (03CR) 10Tim Starling: [C: 03+2] Add extra key for tstarling [puppet] - 10https://gerrit.wikimedia.org/r/533125 (owner: 10Tim Starling) [23:57:27] !log niharika29@deploy1001 Synchronized wmf-config/CommonSettings.php: Revert - [bugfix]Growth experiments not loading conf properly T231935 (duration: 00m 55s) [23:57:34] RoanKattouw: All done. [23:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:38] THanks! [23:58:56] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Cmjohnson) @ottomata All the servers are moved and all of them but cloudvirtan1003 are connected to the swi...