[00:01:56] (03CR) 10Bstorm: "I didn't refactor all this junk on this change. I'm trying to make this a functional change. I can refactor this stuff on another patch." [puppet] - 10https://gerrit.wikimedia.org/r/532487 (owner: 10Bstorm) [00:07:35] (03PS3) 10Bstorm: tools-prometheus: add an allowance for ssh monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532487 [01:31:49] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [01:56:55] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:11:09] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 941.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:32:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Access to HUE for cchen - https://phabricator.wikimedia.org/T231111 (10Mathew.onipe) 05Open→03Resolved I'm guessing everyone is happy so I'm going to close this. [02:35:39] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10Mathew.onipe) @MSantos what's the latest on this? Do you want to follow up on Nuria? [02:37:03] (03PS1) 10CRusnov: Add Netbox instance addresses [dns] - 10https://gerrit.wikimedia.org/r/532502 (https://phabricator.wikimedia.org/T223291) [02:37:24] (03CR) 10jerkins-bot: [V: 04-1] Add Netbox instance addresses [dns] - 10https://gerrit.wikimedia.org/r/532502 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [02:41:57] (03PS2) 10CRusnov: Add Netbox instance addresses [dns] - 10https://gerrit.wikimedia.org/r/532502 (https://phabricator.wikimedia.org/T223291) [02:42:21] (03CR) 10jerkins-bot: [V: 04-1] Add Netbox instance addresses [dns] - 10https://gerrit.wikimedia.org/r/532502 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [02:42:39] what's up? [02:43:42] 10Operations, 10CirrusSearch, 10Discovery-Search (Current work), 10Patch-For-Review: labweb100[12]: Search backend error during get of .[array] after 0: unknown: No enabled connection - https://phabricator.wikimedia.org/T230994 (10Mathew.onipe) a:03dcausse [02:45:00] erf misup [02:51:32] 10Operations, 10Discovery-Search, 10Elasticsearch: Change partitioning scheme for elasticsearch from RAID to JBOD - https://phabricator.wikimedia.org/T231010 (10Mathew.onipe) p:05Triage→03Normal [02:59:01] !log rebooting cp5001 [02:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:19] PROBLEM - Host cp5001 is DOWN: PING CRITICAL - Packet loss = 100% [03:03:41] RECOVERY - Host cp5001 is UP: PING OK - Packet loss = 0%, RTA = 225.28 ms [03:14:43] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [03:15:57] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:17:25] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:19:25] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [03:30:05] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:33:07] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [03:39:33] (03PS3) 10CRusnov: Add Netbox instance addresses [dns] - 10https://gerrit.wikimedia.org/r/532502 (https://phabricator.wikimedia.org/T223291) [03:40:40] 10Operations, 10Traffic: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 (10Vgutierrez) [03:40:54] 10Operations, 10Traffic: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 (10Vgutierrez) p:05Triage→03Normal [03:53:44] !log repooling cp5001 - T231262 [03:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:50] T231262: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 [03:59:13] !log depooling cp5001 - T231262 [03:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:18] T231262: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 [04:01:14] 10Operations, 10Traffic: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 (10Vgutierrez) Further testing shows that the issue is apparently not related to OCSP stapling: ` vgutierrez@cp5001:~$ openssl s_client -connect 127.0.0.1:443 < /dev/null CONNECTED(00000003) write:errn... [04:02:29] (03CR) 10CRusnov: "Rejiggered it to do generic processing of fields, although now Vulture complains about the processing methods not being called since they " (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [04:02:31] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:04:05] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:05:37] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:13:46] (03PS11) 10CRusnov: backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [04:19:53] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (exp [04:19:53] s://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:20:14] (03CR) 10jerkins-bot: [V: 04-1] backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [04:21:27] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:26:59] (03PS1) 10Vgutierrez: ATS: Allow logging specific debug tags to diags.log [puppet] - 10https://gerrit.wikimedia.org/r/532508 (https://phabricator.wikimedia.org/T231262) [04:28:43] (03PS2) 10Vgutierrez: ATS: Allow logging specific debug tags to diags.log [puppet] - 10https://gerrit.wikimedia.org/r/532508 (https://phabricator.wikimedia.org/T231262) [04:29:17] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for tes [04:29:17] he unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:29:30] (03PS3) 10Vgutierrez: ATS: Allow logging specific debug tags to diags.log [puppet] - 10https://gerrit.wikimedia.org/r/532508 (https://phabricator.wikimedia.org/T231262) [04:32:27] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:35:43] (03PS4) 10Vgutierrez: ATS: Allow logging specific debug tags to diags.log [puppet] - 10https://gerrit.wikimedia.org/r/532508 (https://phabricator.wikimedia.org/T231262) [04:43:35] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:43:59] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:22] 10Operations, 10netbox: Setup Swift Storage for Netbox image (was: netbox won't allow me to upload photos of the rack) - https://phabricator.wikimedia.org/T209182 (10crusnov) I have confirmed content-type is set correctly, however Swift sets a content-disposition to attachment which causes browser to download.... [04:52:57] (03PS5) 10Vgutierrez: ATS: Allow logging specific debug tags to diags.log [puppet] - 10https://gerrit.wikimedia.org/r/532508 (https://phabricator.wikimedia.org/T231262) [04:55:26] (03PS1) 10CRusnov: profile::netbox: Fix swift proxy content-disposition [puppet] - 10https://gerrit.wikimedia.org/r/532509 [05:08:07] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool pc2009." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532511 [05:08:19] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool pc2009." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532511 [05:08:21] (03CR) 10Meshvogel: "> The only thing" [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [05:08:54] (03PS6) 10Vgutierrez: ATS: Allow logging specific debug tags to diags.log [puppet] - 10https://gerrit.wikimedia.org/r/532508 (https://phabricator.wikimedia.org/T231262) [05:10:35] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool pc2009." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532511 (owner: 10Marostegui) [05:11:35] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool pc2009." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532511 (owner: 10Marostegui) [05:11:52] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool pc2009." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532511 (owner: 10Marostegui) [05:12:37] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool pc2009 after optimize T210725 (duration: 00m 47s) [05:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:43] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [05:16:31] (03PS1) 10Vgutierrez: hiera: Enable ssl.error and ssl-diag logging on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532513 (https://phabricator.wikimedia.org/T231262) [05:17:33] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532514 (https://phabricator.wikimedia.org/T210725) [05:18:40] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532514 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:19:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532514 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:19:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532514 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:22:15] 10Operations, 10cloud-services-team, 10netops: Review switches ACL to connect from tools-bastion to dbproxy1019 - https://phabricator.wikimedia.org/T230980 (10Marostegui) 05Resolved→03Open Re-opening and it doesn't look like it can connect: ` marostegui@tools-sgebastion-07:~$ telnet dbproxy1019.eqiad.wmn... [05:24:42] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool pc1009 for optimize T210725 (duration: 00m 45s) [05:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:47] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [05:28:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1009 for optimize T210725 (duration: 00m 45s) [05:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:27] !log Optimize pc1009 - T210725 [05:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:17] (03PS7) 10Vgutierrez: ATS: Allow logging specific debug tags to diags.log for localhost requests [puppet] - 10https://gerrit.wikimedia.org/r/532508 (https://phabricator.wikimedia.org/T231262) [05:31:19] (03PS2) 10Vgutierrez: hiera: Enable ssl debugging for ATS on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532513 (https://phabricator.wikimedia.org/T231262) [05:36:37] 10Operations, 10cloud-services-team, 10netops: Review switches ACL to connect from tools-bastion to dbproxy1019 - https://phabricator.wikimedia.org/T230980 (10ayounsi) `lang=diff [edit firewall family inet filter cloud-in4 term labsdb from destination-address] 10.64.37.14/31 { ... } + 10.64.3... [05:36:49] !log update cloud acls on cr1/2-eqiad - T230980 [05:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:54] T230980: Review switches ACL to connect from tools-bastion to dbproxy1019 - https://phabricator.wikimedia.org/T230980 [05:37:07] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:21] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:42] 10Operations, 10cloud-services-team, 10netops: Review switches ACL to connect from tools-bastion to dbproxy1019 - https://phabricator.wikimedia.org/T230980 (10Marostegui) 05Open→03Resolved And now it works! ` marostegui@tools-sgebastion-07:~$ telnet dbproxy1019.eqiad.wmnet 3306 Trying 10.64.37.28... Conn... [05:47:14] 10Operations, 10ops-codfw: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T231267 (10ops-monitoring-bot) [05:48:29] (03CR) 10Volans: [C: 03+1] "LGTM, I'll just mention why we do that" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/532509 (owner: 10CRusnov) [05:50:50] 10Operations, 10ops-codfw: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T231267 (10Marostegui) 05Open→03Declined No need to replace this host - it is waiting to be decommissioned by #dc-ops {T230778} [05:51:07] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Marostegui) [05:57:40] (03PS1) 10Marostegui: mariadb: Promote db2129 as codfw master for s6 [puppet] - 10https://gerrit.wikimedia.org/r/532518 (https://phabricator.wikimedia.org/T230106) [06:11:54] <_joe_> !log updating reprepro sources for jessie-wikimedia [06:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:39] (03PS1) 10Vgutierrez: Release 8.0.5-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/532520 [06:18:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove the service object for the default schema [software/conftool] - 10https://gerrit.wikimedia.org/r/527564 (owner: 10Giuseppe Lavagetto) [06:21:13] (03Merged) 10jenkins-bot: Remove the service object for the default schema [software/conftool] - 10https://gerrit.wikimedia.org/r/527564 (owner: 10Giuseppe Lavagetto) [06:32:26] (03CR) 10Volans: "A couple of questions and some nits inline" (038 comments) [dns] - 10https://gerrit.wikimedia.org/r/532502 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [06:33:35] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [06:34:37] (03CR) 10KartikMistry: "Thanks for the patch. Schedule at: https://wikitech.wikimedia.org/w/index.php?title=Deployments&action=submit#Wednesday,_August_28" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509739 (https://phabricator.wikimedia.org/T220752) (owner: 10Vladis13) [06:44:38] (03CR) 10Volans: "Did a first pass, some general comment inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [07:03:13] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:08:53] (03PS2) 10Giuseppe Lavagetto: Scandium: Add the protocol to the rt-client config [puppet] - 10https://gerrit.wikimedia.org/r/532331 (https://phabricator.wikimedia.org/T230166) (owner: 10Mobrovac) [07:09:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Scandium: Add the protocol to the rt-client config [puppet] - 10https://gerrit.wikimedia.org/r/532331 (https://phabricator.wikimedia.org/T230166) (owner: 10Mobrovac) [07:11:47] 10Operations, 10Traffic, 10Patch-For-Review: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 (10Vgutierrez) from a tcpdump capture, it looks like ATS is actually dropping connections: ` 1007 153.027859 127.0.0.1 → 127.0.0.1 TCP 74 60211 → 443 [SYN] Seq=0 Win=43690... [07:13:27] (03PS1) 10Marostegui: db-codfw.php: Promote db2129 to s6 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532543 (https://phabricator.wikimedia.org/T230106) [07:14:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2129 weight to 0 before promoting it to codfw s6 master T230106', diff saved to https://phabricator.wikimedia.org/P8980 and previous config saved to /var/cache/conftool/dbconfig/20190827-071456-marostegui.json [07:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:02] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [07:16:37] !log Switchover codfw s6 master from db2046 to db2129 T230106 [07:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2129 as codfw master for s6 [puppet] - 10https://gerrit.wikimedia.org/r/532518 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:18:57] (03PS2) 10Marostegui: mariadb: Promote db2129 as codfw master for s6 [puppet] - 10https://gerrit.wikimedia.org/r/532518 (https://phabricator.wikimedia.org/T230106) [07:19:03] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:25:57] !log marostegui@cumin1001 dbctl commit (dc=codfw): 'Promote db2129 to codfw s6 master T230106', diff saved to https://phabricator.wikimedia.org/P8981 and previous config saved to /var/cache/conftool/dbconfig/20190827-072556-marostegui.json [07:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:15] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [07:26:42] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Promote db2129 to s6 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532543 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:26:51] (03CR) 10Volans: [C: 04-1] "Nothing major, some question, comment and nitpick inline." (0320 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [07:27:40] (03Merged) 10jenkins-bot: db-codfw.php: Promote db2129 to s6 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532543 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:27:56] (03CR) 10jenkins-bot: db-codfw.php: Promote db2129 to s6 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532543 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:28:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2046, this host will be decommissioned T230106', diff saved to https://phabricator.wikimedia.org/P8982 and previous config saved to /var/cache/conftool/dbconfig/20190827-072847-marostegui.json [07:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:57] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Promote db2129 as s6 codfw master T230106 (duration: 00m 46s) [07:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:30] <_joe_> still promoting masters manually? [07:29:33] <_joe_> as in with a deploy? [07:30:09] _joe_: Until we get rid of the php file, we are still trying to get it in sync, at least for masters, as we are decommissioning lots of hosts [07:30:13] <_joe_> or just keepong the file in sync? [07:30:16] <_joe_> meh :P [07:30:22] <_joe_> ok thanks [07:30:29] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:32:07] (03PS1) 10Marostegui: db2046: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/532545 (https://phabricator.wikimedia.org/T228258) [07:32:51] (03CR) 10Marostegui: [C: 03+2] db2046: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/532545 (https://phabricator.wikimedia.org/T228258) (owner: 10Marostegui) [07:37:00] 10Operations, 10Traffic, 10Patch-For-Review: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 (10Vgutierrez) Further analysis of ats-tls metrics shows that connections were actually being dropped without being logged: ` vgutierrez@cp5001:~$ sudo -i traffic_ctl --run-root=/... [07:38:51] (03Abandoned) 10Vgutierrez: hiera: Enable ssl debugging for ATS on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532513 (https://phabricator.wikimedia.org/T231262) (owner: 10Vgutierrez) [07:39:05] (03Abandoned) 10Vgutierrez: ATS: Allow logging specific debug tags to diags.log for localhost requests [puppet] - 10https://gerrit.wikimedia.org/r/532508 (https://phabricator.wikimedia.org/T231262) (owner: 10Vgutierrez) [07:41:04] (03PS1) 10Marostegui: db-codfw.php: Re-organize s6 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532548 (https://phabricator.wikimedia.org/T230106) [07:47:15] (03PS2) 10Dzahn: Add services_proxy to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/531686 (https://phabricator.wikimedia.org/T230994) (owner: 10DCausse) [07:48:02] !log marostegui@cumin1001 dbctl commit (dc=codfw): 'Reorganize s6 codfw weights and roles T230106', diff saved to https://phabricator.wikimedia.org/P8983 and previous config saved to /var/cache/conftool/dbconfig/20190827-074802-marostegui.json [07:48:09] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Re-organize s6 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532548 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:10] T230106: Switchover codfw primary database masters to new hosts - https://phabricator.wikimedia.org/T230106 [07:48:57] (03CR) 10Dzahn: [C: 03+2] Add services_proxy to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/531686 (https://phabricator.wikimedia.org/T230994) (owner: 10DCausse) [07:49:21] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:49:23] (03Merged) 10jenkins-bot: db-codfw.php: Re-organize s6 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532548 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:49:23] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:49:38] (03CR) 10jenkins-bot: db-codfw.php: Re-organize s6 codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532548 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [07:49:46] <_joe_> marostegui: are the uncommitted dbctl changes normal/expected? [07:49:49] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:50:22] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Reorganize s6 codfw weights and roles T230106 (duration: 00m 44s) [07:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:56] _joe_: I was doing a bit change, so it took me a while to commit [07:53:05] but everything is committed now [07:53:07] <_joe_> ok so expected :) [07:53:21] yep [07:53:31] maybe we should increase the alert trigger [07:53:49] <_joe_> my poiunt [07:54:04] <_joe_> let's discuss this with chris once he's awake too [07:54:43] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:55:09] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:56:25] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:57:07] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:02:12] (03PS1) 10Vgutierrez: ATS: Allow configure connections_throttle [puppet] - 10https://gerrit.wikimedia.org/r/532555 (https://phabricator.wikimedia.org/T231262) [08:04:01] (03CR) 10Dzahn: "nginx now runs on labweb1001/1002. it listens on high ports on localhost" [puppet] - 10https://gerrit.wikimedia.org/r/531686 (https://phabricator.wikimedia.org/T230994) (owner: 10DCausse) [08:05:34] (03CR) 10Ema: [C: 03+1] ATS: Allow configure connections_throttle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/532555 (https://phabricator.wikimedia.org/T231262) (owner: 10Vgutierrez) [08:06:40] (03PS2) 10Vgutierrez: ATS: Allow configure connections_throttle [puppet] - 10https://gerrit.wikimedia.org/r/532555 (https://phabricator.wikimedia.org/T231262) [08:07:42] (03CR) 10jerkins-bot: [V: 04-1] ATS: Allow configure connections_throttle [puppet] - 10https://gerrit.wikimedia.org/r/532555 (https://phabricator.wikimedia.org/T231262) (owner: 10Vgutierrez) [08:09:39] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:10:06] (03PS3) 10Vgutierrez: ATS: Allow configure connections_throttle [puppet] - 10https://gerrit.wikimedia.org/r/532555 (https://phabricator.wikimedia.org/T231262) [08:10:41] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:15:01] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp1075 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/531896 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:15:27] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:16:35] (03PS4) 10Vgutierrez: ATS: Allow configure connections_throttle [puppet] - 10https://gerrit.wikimedia.org/r/532555 (https://phabricator.wikimedia.org/T231262) [08:18:33] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:18:39] !log depool cp1075 and reimage as text_ats T228629 [08:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:17] T228629: ATS Backends: Test live cache_text traffic - https://phabricator.wikimedia.org/T228629 [08:21:23] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 86, down: 0, shutdown: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:21:53] (03PS3) 10Ema: cache: reimage cp1075 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/531896 (https://phabricator.wikimedia.org/T227432) [08:22:25] (03CR) 10Dzahn: [C: 03+1] "yea, this is just ::profile::backup::storage which is also in role::backup and that already has IPv6 enabled the same way, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/531233 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [08:23:04] (03PS5) 10Vgutierrez: ATS: Allow configure connections_throttle [puppet] - 10https://gerrit.wikimedia.org/r/532555 (https://phabricator.wikimedia.org/T231262) [08:23:16] 10Operations, 10Icinga, 10observability: Have a link to the alert in the icinga alert email - https://phabricator.wikimedia.org/T231274 (10mobrovac) [08:23:26] (03CR) 10Ema: [C: 03+2] cache: reimage cp1075 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/531896 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:23:53] (03CR) 10Daimona Eaytoy: [C: 04-1] "> > The only thing" [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [08:24:31] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:24:41] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:24:43] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:24:51] (03CR) 10Dzahn: "but it won't be in 2.15.14 which we are running?" [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [08:25:06] * volans checking maintenance calendar [08:25:19] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:26:21] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:28:13] (03PS6) 10Vgutierrez: ATS: Allow configuring connections_throttle [puppet] - 10https://gerrit.wikimedia.org/r/532555 (https://phabricator.wikimedia.org/T231262) [08:28:49] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: ATS Backends: Test live cache_text traffic - https://phabricator.wikimedia.org/T228629 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1075.eqiad.wmnet'] ` The log can be found in `/var/log/wmf... [08:29:25] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:29:33] (03CR) 10Vgutierrez: [C: 03+2] ATS: Allow configuring connections_throttle [puppet] - 10https://gerrit.wikimedia.org/r/532555 (https://phabricator.wikimedia.org/T231262) (owner: 10Vgutierrez) [08:29:49] (03PS7) 10Vgutierrez: ATS: Allow configuring connections_throttle [puppet] - 10https://gerrit.wikimedia.org/r/532555 (https://phabricator.wikimedia.org/T231262) [08:30:01] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:30:47] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:30:57] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:32:21] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active, AS2914/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:35:35] (03CR) 10Dzahn: [C: 03+2] "oops, thanks. that was an oversight indeed" [puppet] - 10https://gerrit.wikimedia.org/r/529470 (owner: 10Arlolra) [08:35:51] (03PS3) 10Dzahn: parsoid::testing: remove more parameter use_parsoid_php [puppet] - 10https://gerrit.wikimedia.org/r/529470 (owner: 10Arlolra) [08:36:14] * volans opening a task and contacting provider for the above [08:36:37] !log repooling cp5001 - T231262 [08:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:44] T231262: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 [08:37:44] (03PS1) 10Ema: cache: convert cp1075 to text_ats (hiera/conftool) [puppet] - 10https://gerrit.wikimedia.org/r/532561 (https://phabricator.wikimedia.org/T227432) [08:38:59] (03CR) 10Dzahn: "outdated or still holding?" [puppet] - 10https://gerrit.wikimedia.org/r/528433 (owner: 10Paladox) [08:39:08] (03CR) 10Vgutierrez: [C: 03+1] cache: convert cp1075 to text_ats (hiera/conftool) [puppet] - 10https://gerrit.wikimedia.org/r/532561 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:39:34] (03Abandoned) 10Alaa Sarhan: Use global $wgThumbLimits as default for repo and client [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528963 (owner: 10Alaa Sarhan) [08:40:32] (03Abandoned) 10Alaa Sarhan: Switch Property Terms migration to WRITE_NEW on test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519211 (https://phabricator.wikimedia.org/T225053) (owner: 10Alaa Sarhan) [08:40:51] (03CR) 10Ema: [C: 03+2] cache: convert cp1075 to text_ats (hiera/conftool) [puppet] - 10https://gerrit.wikimedia.org/r/532561 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:41:57] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:44:15] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [08:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:52] (03CR) 10Ema: [C: 03+1] Release 8.0.5-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/532520 (owner: 10Vgutierrez) [08:45:44] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.5-1wm3 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/532520 (owner: 10Vgutierrez) [08:46:12] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:44] PROBLEM - HHVM rendering on mw1274 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:49:20] PROBLEM - Apache HTTP on mw1274 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:49:46] RECOVERY - HHVM rendering on mw1274 is OK: HTTP OK: HTTP/1.1 200 OK - 75951 bytes in 0.590 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:50:26] RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:53:37] 10Operations, 10Traffic, 10Patch-For-Review: TLS handshake issues with ATS 8.0.5-1wm2 - https://phabricator.wikimedia.org/T231262 (10Vgutierrez) 05Open→03Resolved After disable the connection throttling, cp5001 behaves as expected and no longer drops connections: ` vgutierrez@cp5001:~$ sudo -i traffic_ct... [08:53:41] 10Operations, 10Traffic, 10Patch-For-Review: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [08:55:43] (03CR) 10Dzahn: [C: 03+1] "i could confirm, on mwdebug1002, that fonts-noto-* packages are installed and running fc-list shows for example "Noto Sans CJK JP,Noto San" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528290 (owner: 10Viztor) [08:57:36] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: ATS Backends: Test live cache_text traffic - https://phabricator.wikimedia.org/T228629 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1075.eqiad.wmnet'] ` and were **ALL** successful. [09:00:55] (03PS1) 10Marostegui: mariadb: Productionize dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/532562 (https://phabricator.wikimedia.org/T202367) [09:02:20] 10Operations, 10netops: NTT Transit link flapping, now BGP session down - https://phabricator.wikimedia.org/T231278 (10Volans) p:05Triage→03High [09:02:28] (03PS2) 10Marostegui: mariadb: Productionize dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/532562 (https://phabricator.wikimedia.org/T202367) [09:03:14] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:08:18] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:09:45] (03PS1) 10Ema: cache: ATS storage configuration for cp1075 [puppet] - 10https://gerrit.wikimedia.org/r/532644 (https://phabricator.wikimedia.org/T227432) [09:09:56] !log mobrovac@deploy1001 Started deploy [cpjobqueue/deploy@c2bc1a3]: Increase cirrusSearchLinksUpdate concurrency to 150 - T231194 [09:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:01] T231194: Increase concurrency of the cirrusCheckerJob - https://phabricator.wikimedia.org/T231194 [09:11:05] !log mobrovac@deploy1001 Finished deploy [cpjobqueue/deploy@c2bc1a3]: Increase cirrusSearchLinksUpdate concurrency to 150 - T231194 (duration: 01m 09s) [09:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:47] (03CR) 10Ema: [C: 03+2] cache: ATS storage configuration for cp1075 [puppet] - 10https://gerrit.wikimedia.org/r/532644 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:15:36] (03PS1) 10Vgutierrez: ATS: make sure that the systemd service is enabled [puppet] - 10https://gerrit.wikimedia.org/r/532652 [09:16:06] (03CR) 10jerkins-bot: [V: 04-1] ATS: make sure that the systemd service is enabled [puppet] - 10https://gerrit.wikimedia.org/r/532652 (owner: 10Vgutierrez) [09:17:10] 10Operations, 10netops: NTT Transit link flapping, now BGP session down - https://phabricator.wikimedia.org/T231278 (10Volans) It seems that the session is misconfigured on their side: ` Aug 27 09:10:38 cr2-eqord rpd[13953]: bgp_process_open:4072: NOTIFICATION sent to 2001:418:0:5000::a34 (External AS 2914):... [09:17:58] (03PS2) 10Vgutierrez: ATS: make sure that the systemd service is enabled [puppet] - 10https://gerrit.wikimedia.org/r/532652 [09:20:22] cp1072 being worked on? [09:20:26] !log uploaded trafficserver-8.0.5-1wm3 to apt.wikimedia.org (stretch) - T221594 [09:20:28] mgmt is down and host changed key [09:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:32] T221594: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594 [09:20:51] mutante: not as far as I know, I'm working on cp1075 [09:21:15] !log upgrading trafficserver to version 8.0.5-1wm3 on cp5001 - T221594 [09:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:30] !log Remove grants for dbproxy1004 and dbproxy1009 from m4 hosts (db1107 and db1108) - T231280 [09:21:34] ema: ok, i'm looking [09:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:36] T231280: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 [09:21:50] mutante: thanks. It seems to be a spare [09:22:24] ema: oh, yea, there is a ticket for the decom. just needs more ACKing. not worth it [09:24:07] ACKNOWLEDGEMENT - SSH cp1072.mgmt on cp1072.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T229586 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:25:09] 10Operations, 10Traffic, 10netops, 10IPv6, 10Patch-For-Review: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jbond) >>! In T102099#5433561, @jcrespo wrote: > Hi, I am bit disconnected about the planning of deployment of this- Once all hosts (o... [09:26:16] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 88, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:26:51] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531752 (owner: 10Ayounsi) [09:28:01] (03PS2) 10Jbond: mariadb::parsercache - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531263 (https://phabricator.wikimedia.org/T102099) [09:28:36] (03CR) 10Jbond: [C: 03+2] mariadb::parsercache - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531263 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:29:49] (03PS1) 10Dzahn: parsoid::testing: remove remnants for parsoid switch from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/532656 [09:29:51] (03PS1) 10Dzahn: ganeti/icinga: allow 3 ganeti-noded processes before alerting [puppet] - 10https://gerrit.wikimedia.org/r/532657 [09:29:58] (03PS2) 10Jbond: mariadb::core_multiinstance - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531173 (https://phabricator.wikimedia.org/T102099) [09:30:30] (03PS2) 10Dzahn: parsoid::testing: remove remnants for parsoid switch from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/532656 [09:30:52] (03Abandoned) 10Dzahn: parsoid::testing: remove remnants for parsoid switch from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/532656 (owner: 10Dzahn) [09:30:57] (03CR) 10Jbond: [C: 03+2] mariadb::core_multiinstance - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531173 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:31:03] (03PS2) 10Dzahn: ganeti/icinga: allow 3 ganeti-noded processes before alerting [puppet] - 10https://gerrit.wikimedia.org/r/532657 [09:32:48] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10Dzahn) FYI: "CRITICAL - degraded: The system is operational but one or more units failed." "CRITICAL: Status of the systemd unit gl... [09:36:05] 10Operations, 10Icinga, 10observability: Have a link to the alert in the icinga alert email - https://phabricator.wikimedia.org/T231274 (10fgiunchedi) Reporting from irc, the list of macros we can use in expansions is https://icinga.com/docs/icinga1/latest/en/macrolist.html and a link to the service page loo... [09:37:20] (03PS1) 10Giuseppe Lavagetto: aqs: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532660 [09:37:22] (03PS3) 10Jbond: backup::offsite: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531233 (https://phabricator.wikimedia.org/T102099) [09:37:24] (03PS1) 10Giuseppe Lavagetto: k8s::master: switch to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532661 [09:37:26] (03PS1) 10Giuseppe Lavagetto: k8s::worker: switch to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532662 [09:37:28] (03PS1) 10Giuseppe Lavagetto: elasticsearch::cirrus: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532663 [09:37:30] (03PS1) 10Giuseppe Lavagetto: Removing hiera file for role::eventbus::eventbus, unused [puppet] - 10https://gerrit.wikimedia.org/r/532664 [09:37:32] (03PS1) 10Giuseppe Lavagetto: maps: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532665 [09:38:26] 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) 05Open→03Resolved [09:38:32] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) [09:38:50] 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) [09:38:53] 10Operations, 10Traffic: ATS lacks the possibility of reporting SSL stats to an origin server via HTTP Headers - https://phabricator.wikimedia.org/T228135 (10Vgutierrez) 05Open→03Resolved [09:39:02] !log Deploy grants for dbproxy1016 on m3 - T202367 [09:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:07] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [09:40:03] (03PS1) 10Giuseppe Lavagetto: wdqs: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532666 [09:40:05] (03PS1) 10Giuseppe Lavagetto: restbase: convert to use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532667 [09:40:07] (03PS1) 10Giuseppe Lavagetto: scb: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532668 [09:40:09] (03PS1) 10Giuseppe Lavagetto: ores: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532669 [09:40:11] (03PS1) 10Giuseppe Lavagetto: proton: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532670 [09:40:13] (03PS1) 10Giuseppe Lavagetto: openldap,labweb: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532671 [09:40:15] (03PS1) 10Giuseppe Lavagetto: profile::lvs::realserver: remove absenting of old restart script [puppet] - 10https://gerrit.wikimedia.org/r/532672 [09:40:17] (03PS1) 10Giuseppe Lavagetto: role::lvs::realserver: remove from puppet [puppet] - 10https://gerrit.wikimedia.org/r/532673 [09:41:49] 10Operations, 10serviceops, 10Patch-For-Review: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) p:05Normal→03High [09:42:29] (03CR) 10Marostegui: "Grants deployed: https://phabricator.wikimedia.org/T202367#5440592" [puppet] - 10https://gerrit.wikimedia.org/r/532562 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [09:42:37] (03PS3) 10Marostegui: mariadb: Productionize dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/532562 (https://phabricator.wikimedia.org/T202367) [09:42:45] (03CR) 10Jbond: [C: 03+2] backup::offsite: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531233 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:43:27] (03PS1) 10DannyS712: Set `$wgRelatedArticlesDescriptionSource` to `wikidata` in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532675 (https://phabricator.wikimedia.org/T231279) [09:44:21] 10Operations, 10Traffic: Track TLS related ATS metrics in prometheus - https://phabricator.wikimedia.org/T231286 (10Vgutierrez) [09:44:46] (03PS4) 10Marostegui: mariadb: Productionize dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/532562 (https://phabricator.wikimedia.org/T202367) [09:44:48] (03PS2) 10DannyS712: Set `$wgRelatedArticlesDescriptionSource` to `wikidata` in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532675 (https://phabricator.wikimedia.org/T231279) [09:45:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "Please take a look, I think we're ok to remove the per-host CPU alerts now." [puppet] - 10https://gerrit.wikimedia.org/r/531142 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [09:45:58] 10Operations, 10netops: NTT Transit link flapping, now BGP session down - https://phabricator.wikimedia.org/T231278 (10Volans) 05Open→03Resolved a:03Volans It was a maintenance, tracked with GIN-1-2116159603, that was not present to the calendar because sent to noc@ and not the maint announce ML. We need... [09:48:12] 10Operations, 10Traffic: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 (10Vgutierrez) [09:50:20] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:51:41] (03CR) 10Martineznovo: [C: 03+1] "Thanks for taking care of this. I'm not at home and downloading this repo was taking a long time on mobile" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532675 (https://phabricator.wikimedia.org/T231279) (owner: 10DannyS712) [09:53:25] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/532562 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [09:59:30] (03PS2) 10Dzahn: add IP for miscweb1001 [dns] - 10https://gerrit.wikimedia.org/r/512446 (https://phabricator.wikimedia.org/T224247) [09:59:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18048/aqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/532660 (owner: 10Giuseppe Lavagetto) [10:00:16] (03PS2) 10Giuseppe Lavagetto: aqs: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532660 [10:06:57] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10media-storage: upload LB: retry swift 404s cross-cluster - https://phabricator.wikimedia.org/T231108 (10ema) p:05Triage→03Normal [10:10:35] (03PS1) 10Vgutierrez: ATS: Fix Client-IP on TLS log format [puppet] - 10https://gerrit.wikimedia.org/r/532681 [10:11:53] (03PS3) 10Dzahn: add IP for miscweb1001 [dns] - 10https://gerrit.wikimedia.org/r/512446 (https://phabricator.wikimedia.org/T224247) [10:12:17] !log cirrus: reindexing lost updates since 2019-08-12T10:00:00Z for wikitech (T230994) [10:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:28] T230994: labweb100[12]: Search backend error during get of .[array] after 0: unknown: No enabled connection - https://phabricator.wikimedia.org/T230994 [10:13:28] (03CR) 10Dzahn: [C: 03+2] add IP for miscweb1001 [dns] - 10https://gerrit.wikimedia.org/r/512446 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [10:15:57] 10Operations, 10Traffic, 10netops, 10IPv6, 10Patch-For-Review: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10jcrespo) > Sorry for the lack of clarity, once all servers have the mapped ipv6 address i plan to move this to the base profile with s... [10:17:17] (03PS2) 10Vgutierrez: ATS: Fix Client-IP on TLS log format [puppet] - 10https://gerrit.wikimedia.org/r/532681 [10:17:24] (03PS1) 10Dzahn: site: add miscweb role to miscweb1001 [puppet] - 10https://gerrit.wikimedia.org/r/532683 (https://phabricator.wikimedia.org/T224247) [10:18:01] 10Operations, 10serviceops, 10Patch-For-Review: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) [10:18:17] (03CR) 10Ema: [C: 03+1] ATS: Fix Client-IP on TLS log format [puppet] - 10https://gerrit.wikimedia.org/r/532681 (owner: 10Vgutierrez) [10:18:52] (03CR) 10Vgutierrez: [C: 03+2] ATS: Fix Client-IP on TLS log format [puppet] - 10https://gerrit.wikimedia.org/r/532681 (owner: 10Vgutierrez) [10:19:10] (03PS3) 10Vgutierrez: ATS: Fix Client-IP on TLS log format [puppet] - 10https://gerrit.wikimedia.org/r/532681 [10:20:25] 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) a:03Marostegui I am going to start removing `sarin` grants first [10:21:42] (03PS1) 10Mforns: analytics::refinery::job::data_purge.pp: fix geoeditors retention period [puppet] - 10https://gerrit.wikimedia.org/r/532684 (https://phabricator.wikimedia.org/T231017) [10:25:03] (03PS2) 10Mforns: analytics::refinery::job::data_purge.pp: fix geoeditors retention period [puppet] - 10https://gerrit.wikimedia.org/r/532684 (https://phabricator.wikimedia.org/T231017) [10:25:05] !log ganeti eqiad - creating new VM with same specs as krypton to replace it with a stretch instance and mirror miscweb2001. krypton to be removed (T224323, T105507, T224247) [10:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:12] T224323: ganeti VM request - miscweb2001 - equivalent of krypton - https://phabricator.wikimedia.org/T224323 [10:25:13] T224247: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 [10:25:13] T105507: request VM for misc. PHP applications - https://phabricator.wikimedia.org/T105507 [10:25:18] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 109.4 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [10:25:43] !log Remove grants from sarin from all the dbs, dbstore, parsercache, es, labsdb - T229796 [10:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:49] T229796: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 [10:26:18] (03PS3) 10Mforns: analytics::refinery::job::data_purge.pp: fix geoeditors retention period [puppet] - 10https://gerrit.wikimedia.org/r/532684 (https://phabricator.wikimedia.org/T231017) [10:28:16] (03CR) 10Mforns: "Tested this in stat1007, and looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/532684 (https://phabricator.wikimedia.org/T231017) (owner: 10Mforns) [10:31:36] (03CR) 10Jbond: "looks good, see comments" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [10:32:05] (03CR) 10Marostegui: [C: 03+1] mariadb::core - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531174 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:34:28] (03PS3) 10Giuseppe Lavagetto: aqs: convert to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532660 [10:37:20] (03PS2) 10Jbond: mariadb::core - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531174 (https://phabricator.wikimedia.org/T102099) [10:38:38] (03CR) 10Jbond: [C: 03+2] mariadb::core - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531174 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:40:04] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10Nuria) Approved on my end if employment and nda have been stablished. @MSantos Please read https://office.wikimedia.org/wiki/Data_access_guidelines and https://wiki... [10:41:58] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10Nuria) mmm.. actually hold on, referrer info is available in turnilo for the tile service: please see: https://turnilo.wikimedia.org/#webrequest_sampled_128 [10:44:02] (03CR) 10Joal: [C: 03+1] "thanks mforns :)" [puppet] - 10https://gerrit.wikimedia.org/r/532684 (https://phabricator.wikimedia.org/T231017) (owner: 10Mforns) [10:44:51] 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) `sarin` grants have been removed everywhere. [10:45:18] 10Operations, 10DBA: Remove sarin and neodymium GRANTs from all the databases - https://phabricator.wikimedia.org/T229796 (10Marostegui) [10:48:44] (03CR) 10Ema: [C: 03+1] ATS: make sure that the systemd service is enabled [puppet] - 10https://gerrit.wikimedia.org/r/532652 (owner: 10Vgutierrez) [10:49:31] (03CR) 10Jbond: [C: 03+1] "lgtm, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/531897 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [10:51:47] (03PS7) 10DannyS712: General cleanup of initialize settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) [10:54:23] (03PS1) 10Dzahn: install_server: add miscweb1001 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/532687 (https://phabricator.wikimedia.org/T224247) [10:55:54] (03CR) 10Dzahn: [C: 03+2] install_server: add miscweb1001 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/532687 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [10:56:02] (03PS2) 10Dzahn: install_server: add miscweb1001 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/532687 (https://phabricator.wikimedia.org/T224247) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190827T1100). [11:00:04] raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] raynor: I'm happy to deploy your patch, unless you'd like to? [11:00:27] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall. Added Traffic folks as heads up" (031 comment) [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/532426 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [11:00:55] o/ [11:01:28] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [11:02:58] I have deployments rights, I can proceed by my own [11:03:29] Amir1 Lucas_WMDE awight Urbanecm ^ [11:03:30] great, good luck! [11:03:40] Go ahead [11:03:51] (03PS2) 10Pmiazga: Drop MobileWebUIActionsTracking sampling rate to 0.01% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532422 (https://phabricator.wikimedia.org/T220016) [11:04:15] thx [11:04:20] (03CR) 10Pmiazga: [C: 03+2] Drop MobileWebUIActionsTracking sampling rate to 0.01% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532422 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [11:04:24] (03CR) 10Nuria: [C: 03+1] "Nice" [puppet] - 10https://gerrit.wikimedia.org/r/532684 (https://phabricator.wikimedia.org/T231017) (owner: 10Mforns) [11:07:06] (03Merged) 10jenkins-bot: Drop MobileWebUIActionsTracking sampling rate to 0.01% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532422 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [11:09:39] testing on mwdebug1002 [11:10:29] !log ganeti1001 - starting and OS install of new VM miscweb1001 [11:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:24] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:532422|Drop MobileWebUIActionsTracking sampling rate to 0.01% (T220016)]] (duration: 00m 46s) [11:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:31] T220016: Create, and deploy working MobileWebUIActionsTracking schema - https://phabricator.wikimedia.org/T220016 [11:15:40] synced [11:15:51] anyone wants to push sth more in current SWAT window? [11:15:54] !log depooling cp5001 [11:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:26] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [11:17:52] (03PS3) 10Filippo Giunchedi: monitoring::host: rename critical to paging [puppet] - 10https://gerrit.wikimedia.org/r/528462 (https://phabricator.wikimedia.org/T228379) [11:18:35] !log EU SWAT finished [11:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:45] (03PS1) 10Marostegui: dbproxy1019: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/532692 (https://phabricator.wikimedia.org/T202367) [11:18:58] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [11:19:47] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/532692 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [11:21:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "Good to go IMHO, PCC https://puppet-compiler.wmflabs.org/compiler1001/18053/" [puppet] - 10https://gerrit.wikimedia.org/r/528462 (https://phabricator.wikimedia.org/T228379) (owner: 10Filippo Giunchedi) [11:25:37] 10Operations, 10Traffic: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 (10Vgutierrez) Triggering the issue is relatively easy browsing https://maps.wikimedia.org with Chrome 76: ` t=264968 [st=29427] HTTP2_SESSION_RECV_GOAWAY --> active_streams = 2... [11:27:59] (03CR) 10jenkins-bot: Drop MobileWebUIActionsTracking sampling rate to 0.01% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532422 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [11:29:52] 10Operations, 10Traffic: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 (10Vgutierrez) [11:36:12] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:40:04] (03PS1) 10DCausse: [cirrus] Stop generating new cirrusSearchChecker jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532694 (https://phabricator.wikimedia.org/T231194) [11:41:06] any objections if I reopen the EU SWAT? [11:42:00] (03PS2) 10Dzahn: site: add miscweb role to miscweb1001 [puppet] - 10https://gerrit.wikimedia.org/r/532683 (https://phabricator.wikimedia.org/T224247) [11:42:19] (03CR) 10Dzahn: [C: 03+2] site: add miscweb role to miscweb1001 [puppet] - 10https://gerrit.wikimedia.org/r/532683 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [11:43:11] !log reopening EU SWAT [11:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:03] (03CR) 10Gehel: [C: 04-1] "missing the @cee: token prefix for json syslog" [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [11:45:42] (03PS6) 10Jbond: puppetmaster::frontend: update web conf to use RewriteRules instead of proxypass [puppet] - 10https://gerrit.wikimedia.org/r/528521 (https://phabricator.wikimedia.org/T228657) [11:45:53] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532694 (https://phabricator.wikimedia.org/T231194) (owner: 10DCausse) [11:46:50] (03Merged) 10jenkins-bot: [cirrus] Stop generating new cirrusSearchChecker jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532694 (https://phabricator.wikimedia.org/T231194) (owner: 10DCausse) [11:47:06] (03CR) 10jenkins-bot: [cirrus] Stop generating new cirrusSearchChecker jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532694 (https://phabricator.wikimedia.org/T231194) (owner: 10DCausse) [11:49:24] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T231194 [cirrus] Stop generating new cirrusSearchChecker jobs (duration: 00m 45s) [11:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:30] T231194: Increase concurrency of the cirrusCheckerJob - https://phabricator.wikimedia.org/T231194 [11:51:31] !log miscweb1001 - manually remove tin.eqiad.wmnet (!) from /srv/iegreview/iegreview-cache/.config and replace with deploy1001 after first puppet run. still existing bug that tin is not fully removed (T224247, T175288, T197470) [11:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:40] T224247: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 [11:51:41] T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288 [11:51:41] T197470: find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 [11:52:39] !log EU Swat done [11:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:56] !log miscweb1001 - a2dismod mpm_event ; a2enmod php7.0 ; systemctl restart apache2 (T224247, T196968) please also see https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206 [11:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:04] T196968: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 [11:56:30] (03CR) 10Dzahn: "when setting up a new host miscweb this still needed manual intervention, see https://phabricator.wikimedia.org/T224247#5441002" [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190827T1200) [12:04:09] (03PS1) 10Dzahn: trafficserver/varnish: replace krypton with miscweb1001, rename director [puppet] - 10https://gerrit.wikimedia.org/r/532695 (https://phabricator.wikimedia.org/T224247) [12:04:37] (03CR) 10Marostegui: [C: 03+1] labs::db: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531240 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:10:13] cutting the branch, please don't restart gerrit ;) [12:10:20] (it has happened before) [12:14:00] (03CR) 10Volans: [C: 03+2] templates: add rendering of templates [software/homer] - 10https://gerrit.wikimedia.org/r/532453 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [12:15:19] !log pool cp1075 w/ ATS backend T228629 [12:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:25] T228629: ATS Backends: Test live cache_text traffic - https://phabricator.wikimedia.org/T228629 [12:17:06] (03Merged) 10jenkins-bot: templates: add rendering of templates [software/homer] - 10https://gerrit.wikimedia.org/r/532453 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [12:17:18] !log depool cp1075, confd is not watching the key "ats-be" [12:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:09] 10Operations, 10ops-eqiad: Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts: ` ['elastic1046.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201908271218_gehel_1280... [12:19:47] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Gehel) [12:21:06] (03CR) 10Volans: [C: 03+2] actions: add generate action [software/homer] - 10https://gerrit.wikimedia.org/r/532454 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [12:22:08] (03CR) 10Jbond: [C: 03+2] puppetmaster::frontend: update web conf to use RewriteRules instead of proxypass [puppet] - 10https://gerrit.wikimedia.org/r/528521 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [12:23:09] (03PS1) 10Jbond: Revert "puppetmaster::frontend: update web conf to use RewriteRules instead of proxypass" [puppet] - 10https://gerrit.wikimedia.org/r/532699 [12:23:44] RECOVERY - SSH on elastic1046 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:24:10] zeljkof: heh, ok. no gerrit restart. thanks [12:24:20] (03CR) 10jenkins-bot: templates: add rendering of templates [software/homer] - 10https://gerrit.wikimedia.org/r/532453 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [12:24:21] (03CR) 10Gehel: [C: 04-1] elasticsearch: ship logs to syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [12:24:23] (03Merged) 10jenkins-bot: actions: add generate action [software/homer] - 10https://gerrit.wikimedia.org/r/532454 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [12:24:37] (03PS1) 10Ema: cache_text eqiad: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/532700 (https://phabricator.wikimedia.org/T227432) [12:25:23] (03CR) 10jenkins-bot: actions: add generate action [software/homer] - 10https://gerrit.wikimedia.org/r/532454 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [12:28:07] (03CR) 10Filippo Giunchedi: elasticsearch: ship logs to syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [12:28:39] 10Operations, 10Continuous-Integration-Config: add ci test for admin module indentation - https://phabricator.wikimedia.org/T190766 (10hashar) 05Open→03Resolved I introduced the same faulty indentation and `modules/admin/data/data_test.py` does fail since it can not parse the yaml: ` ParserError: while par... [12:29:36] !log Rename table filejournal on enwiki on db1089 - T51195 [12:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:41] T51195: Drop filejournal table from WMF - https://phabricator.wikimedia.org/T51195 [12:29:58] (03PS2) 10Ema: cache_text eqiad: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/532700 (https://phabricator.wikimedia.org/T227432) [12:30:31] (03CR) 10Ema: [C: 03+2] cache_text eqiad: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/532700 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [12:32:47] (03PS1) 10Dzahn: site/install_server: remote krypton.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/532701 (https://phabricator.wikimedia.org/T224247) [12:33:50] mutante: thank you ;) [12:34:31] (03CR) 10Phamhi: [C: 03+2] tools-prometheus: add an allowance for ssh monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532487 (owner: 10Bstorm) [12:36:04] !log pool cp1075 w/ ATS backend (for real) T228629 [12:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:10] T228629: ATS Backends: Test live cache_text traffic - https://phabricator.wikimedia.org/T228629 [12:36:13] (03CR) 10Gehel: [C: 04-1] elasticsearch: ship logs to syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [12:37:35] (03PS2) 10Jbond: labs::db: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531240 (https://phabricator.wikimedia.org/T102099) [12:38:09] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10MSantos) Thanks @Nuria, @mpopov and @Mathew.onipe. >>! In T227695#5440789, @Nuria wrote: > mmm.. actually hold on, referrer info is available in turnilo for the til... [12:38:28] (03PS1) 10Dzahn: logstash: replace krypton with grafana1001 in collector ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/532702 (https://phabricator.wikimedia.org/T224247) [12:38:39] (03CR) 10Jbond: [C: 03+2] labs::db: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531240 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:39:25] (03PS2) 10Dzahn: site/install_server: remove krypton.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/532701 (https://phabricator.wikimedia.org/T224247) [12:40:56] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 18812 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [12:41:14] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:45:01] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review: rspec-puppet fails with Could not find the daemon directory (tested [/etc/sv,/var/lib/service]) - https://phabricator.wikimedia.org/T203645 (10hashar) 05Open→03Resolved a:03hashar When someone encounters the issue, the module spec_helpe... [12:50:14] RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [12:53:58] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic1046.eqiad.wmnet'] ` Of which those **FAILED**: ` ['elastic1046.eqiad.wmnet'] ` [12:56:14] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Gehel) @Cmjohnson: it looks like the installer only sees a single disk, and thus can't partition. Could you check? Thanks! [12:56:25] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10Gehel) a:05Gehel→03Cmjohnson [12:56:32] (03PS1) 10Ema: prometheus: fetch cache_text atsmtail@backend metrics [puppet] - 10https://gerrit.wikimedia.org/r/532705 (https://phabricator.wikimedia.org/T227432) [12:58:36] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10Nuria) @MSantos As i mentioned above you do not need queries, turnilo actually has that data, please see: https://turnilo.wikimedia.org/#webrequest_sampled_128 [12:59:00] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Patch-For-Review: Icinga should alert on free disk space < 15% (now < 12%) on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329 (10fgiunchedi) 05Resolved→03Open reopening as I think this is happening again, low disk space... [12:59:12] (03CR) 10Luke081515: [C: 03+1] [rowiki] Allow sysops to name patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531956 (https://phabricator.wikimedia.org/T231099) (owner: 10Strainu) [12:59:38] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: fetch cache_text atsmtail@backend metrics [puppet] - 10https://gerrit.wikimedia.org/r/532705 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:00:04] zeljkof: #bothumor I � Unicode. All rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190827T1300). [13:00:10] (03CR) 10Ema: [C: 03+2] prometheus: fetch cache_text atsmtail@backend metrics [puppet] - 10https://gerrit.wikimedia.org/r/532705 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:04:09] (03PS10) 10Jhedden: openstack: Add codfw1dev nova API and metadata to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [13:05:12] (03PS1) 10Filippo Giunchedi: prometheus: bump logstash rate of ingestion threshold [puppet] - 10https://gerrit.wikimedia.org/r/532707 (https://phabricator.wikimedia.org/T228878) [13:08:47] (03PS11) 10Jhedden: openstack: Add codfw1dev nova API and metadata to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [13:11:15] (03CR) 10CDanis: [C: 03+1] monitoring: alert on availability over two minutes [puppet] - 10https://gerrit.wikimedia.org/r/532335 (https://phabricator.wikimedia.org/T228379) (owner: 10Filippo Giunchedi) [13:12:03] (03CR) 10CDanis: [C: 03+1] nrpe::monitor_service: Make notes_url optional for ensure=absent [puppet] - 10https://gerrit.wikimedia.org/r/529590 (owner: 10Alex Monk) [13:12:05] (03CR) 10Jhedden: "great suggestions. thanks for the review, Arturo! I've made the changes." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [13:12:54] (03CR) 10CDanis: [C: 03+1] mediawiki: remove per-host high CPU alerts [puppet] - 10https://gerrit.wikimedia.org/r/531142 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [13:14:18] (03CR) 10CDanis: "Hmm. Since we're here, I'm trying to think of a name that encompasses "paging everyone" or "paging all SRE" or "paging the default pager " [puppet] - 10https://gerrit.wikimedia.org/r/528462 (https://phabricator.wikimedia.org/T228379) (owner: 10Filippo Giunchedi) [13:17:23] 10Operations, 10Traffic, 10observability, 10User-fgiunchedi: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668 (10fgiunchedi) Per-backend metrics are in place now via mtail, specifically: * request count: by backend, method, and status * total time spent took by requests: by b... [13:18:45] I'm preparing for train, but it's blocked, so there will be no deployments [13:19:20] well, in unlikely event that all blockers are resolved in the next less than two hours, then there will be train :) [13:20:53] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10MSantos) @Nuria `maps200[1-4].codfw.wmnet` and `maps100[1-4].eqiad.wmnet` don't seem to be available in turnilo, also tiles are requested from the domain `maps.wikim... [13:25:24] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:27:23] (03CR) 10CDanis: [C: 03+1] "LG but a question: there are going to be more Grafana hosts in the future (e.g. I will probably test grafana 6.x on a to-be-created grafan" [puppet] - 10https://gerrit.wikimedia.org/r/532702 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [13:29:42] 10Operations: setup/install gerrit1001 - https://phabricator.wikimedia.org/T231046 (10Dzahn) Thanks @RobH it was right to assign this to me directly. I'll do that. [13:33:54] (03CR) 10Dzahn: "yea, i agree. it would be best to avoid any host names inside the class. it should be a parameter of the profile just like it already has " [puppet] - 10https://gerrit.wikimedia.org/r/532702 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [13:34:01] (03PS1) 10Zfilipin: Group0 to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532710 [13:35:20] (03CR) 10Dzahn: "looks like this just happens to work right now (storing dashboards?) because the next ferm rule below opens the same port" [puppet] - 10https://gerrit.wikimedia.org/r/532702 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [13:36:40] (03CR) 10Dzahn: [C: 03+2] logstash: replace krypton with grafana1001 in collector ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/532702 (https://phabricator.wikimedia.org/T224247) (owner: 10Dzahn) [13:36:50] (03PS2) 10Dzahn: logstash: replace krypton with grafana1001 in collector ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/532702 (https://phabricator.wikimedia.org/T224247) [13:43:05] (03PS1) 10Aklapper: Phabricator monthly email: Cover how to get list of most active task authors [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) [13:43:33] !log repool cp5001 - T231287 [13:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:41] T231287: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 [13:45:29] (03CR) 10Aklapper: "Hmm, do I have to escape the quotation marks?" [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) (owner: 10Aklapper) [13:46:12] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Extract metrics from logs - https://phabricator.wikimedia.org/T147923 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Going to resolve this generic task as we have mtail deployed in production and multiple users [13:47:21] !log zfilipin@deploy1001 Pruned MediaWiki: 1.34.0-wmf.15 (duration: 06m 44s) [13:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:07] (03PS2) 10Dzahn: Phabricator monthly email: Cover how to get list of most active task authors [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) (owner: 10Aklapper) [13:50:16] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) (owner: 10Aklapper) [13:51:05] (03CR) 10jerkins-bot: [V: 04-1] Phabricator monthly email: Cover how to get list of most active task authors [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) (owner: 10Aklapper) [13:51:19] (03CR) 10jerkins-bot: [V: 04-1] Phabricator monthly email: Cover how to get list of most active task authors [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) (owner: 10Aklapper) [13:52:09] !log zfilipin@deploy1001 Pruned MediaWiki: 1.34.0-wmf.16 [keeping static files] (duration: 01m 35s) [13:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:23] (03CR) 10Dzahn: "you should not have to quote them since we use heredoc syntax, but something else fails here:" [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) (owner: 10Aklapper) [13:53:19] (03PS3) 10Dzahn: Phabricator monthly email: Cover how to get list of most active task authors [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) (owner: 10Aklapper) [13:54:22] (03CR) 10jerkins-bot: [V: 04-1] Phabricator monthly email: Cover how to get list of most active task authors [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) (owner: 10Aklapper) [13:56:58] 10Operations, 10Traffic, 10Patch-For-Review, 10User-fgiunchedi: Prometheus varnish metric churn due to VCL reloads - https://phabricator.wikimedia.org/T150479 (10fgiunchedi) There is still some churn due to the fact that multiple VCLs are loaded at the same time, and we're generating new uuids via `reload-... [13:57:19] 10Operations, 10Traffic, 10Patch-For-Review, 10User-fgiunchedi: Prometheus varnish metric churn due to VCL reloads - https://phabricator.wikimedia.org/T150479 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi [13:58:06] (03PS4) 10Dzahn: Phabricator monthly email: Cover how to get list of most active task authors [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) (owner: 10Aklapper) [14:01:06] (03CR) 10Dzahn: [C: 03+2] Phabricator monthly email: Cover how to get list of most active task authors [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) (owner: 10Aklapper) [14:02:57] (03CR) 10Bstorm: [C: 04-1] "I feel like this perpetuates the confusion or even deepens it. There are SMS users that are not in the sms contact group (for subteam pag" [puppet] - 10https://gerrit.wikimedia.org/r/528462 (https://phabricator.wikimedia.org/T228379) (owner: 10Filippo Giunchedi) [14:03:08] !log zfilipin@deploy1001 Started scap: testwiki to php-1.34.0-wmf.20 and rebuild l10n cache [14:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:16] (03PS2) 10Herron: change user to root [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/532426 (https://phabricator.wikimedia.org/T230236) [14:05:44] 10Operations, 10CirrusSearch, 10Discovery-Search (Current work): labweb100[12]: Search backend error during get of .[array] after 0: unknown: No enabled connection - https://phabricator.wikimedia.org/T230994 (10debt) 05Open→03Resolved [14:06:58] (03CR) 10Bstorm: [C: 04-1] "I will say it is good to clarify the confusion about the icinga state, though. I could probably be convinced :) I'm just hoping for a s" [puppet] - 10https://gerrit.wikimedia.org/r/528462 (https://phabricator.wikimedia.org/T228379) (owner: 10Filippo Giunchedi) [14:07:08] (03PS3) 10Herron: change user to root [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/532426 (https://phabricator.wikimedia.org/T230236) [14:07:51] (03CR) 10Herron: [V: 03+2 C: 03+2] change user to root (031 comment) [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/532426 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [14:17:04] (03PS3) 10CDanis: dbctl: always validate vs JSON schema [software/conftool] - 10https://gerrit.wikimedia.org/r/531972 [14:19:30] (03CR) 10CDanis: dbctl: always validate vs JSON schema (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/531972 (owner: 10CDanis) [14:22:20] PROBLEM - Check systemd state on cp1081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:28] PROBLEM - statsv Varnishkafka log producer on cp1081 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:23:00] PROBLEM - Webrequests Varnishkafka log producer on cp1081 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:30:24] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 54.70, 26.45, 16.96 https://wikitech.wikimedia.org/wiki/Application_servers [14:30:32] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 60.22, 29.23, 18.63 https://wikitech.wikimedia.org/wiki/Application_servers [14:30:36] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 63.77, 31.81, 21.21 https://wikitech.wikimedia.org/wiki/Application_servers [14:31:28] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 74.45, 34.79, 21.88 https://wikitech.wikimedia.org/wiki/Application_servers [14:31:42] PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 64.60, 30.18, 19.98 https://wikitech.wikimedia.org/wiki/Application_servers [14:31:56] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 73.61, 37.21, 22.50 https://wikitech.wikimedia.org/wiki/Application_servers [14:32:10] RECOVERY - High CPU load on API appserver on mw1278 is OK: OK - load average: 25.66, 27.24, 20.59 https://wikitech.wikimedia.org/wiki/Application_servers [14:32:24] PROBLEM - High CPU load on API appserver on mw1286 is CRITICAL: CRITICAL - load average: 60.16, 34.60, 22.05 https://wikitech.wikimedia.org/wiki/Application_servers [14:33:04] RECOVERY - High CPU load on API appserver on mw1277 is OK: OK - load average: 26.54, 29.12, 21.10 https://wikitech.wikimedia.org/wiki/Application_servers [14:33:16] RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 31.16, 30.04, 21.07 https://wikitech.wikimedia.org/wiki/Application_servers [14:33:30] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 15.05, 22.31, 17.26 https://wikitech.wikimedia.org/wiki/Application_servers [14:33:32] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 26.31, 30.94, 21.69 https://wikitech.wikimedia.org/wiki/Application_servers [14:33:40] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 15.65, 22.78, 18.14 https://wikitech.wikimedia.org/wiki/Application_servers [14:33:56] RECOVERY - High CPU load on API appserver on mw1286 is OK: OK - load average: 23.67, 29.02, 21.22 https://wikitech.wikimedia.org/wiki/Application_servers [14:33:56] !log zfilipin@deploy1001 Finished scap: testwiki to php-1.34.0-wmf.20 and rebuild l10n cache (duration: 30m 48s) [14:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:28] (03PS8) 10BBlack: anycast recdns: enable for codfw clients [puppet] - 10https://gerrit.wikimedia.org/r/526788 (https://phabricator.wikimedia.org/T228190) [14:38:33] !log depool cp5001 - T231287 [14:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:38] T231287: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 [14:38:52] (03CR) 10Zfilipin: [C: 03+2] Group0 to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532710 (owner: 10Zfilipin) [14:39:14] !log cp1081: restart crashed services varnishkafka-{statsv,webrequest}.service [14:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:34] RECOVERY - statsv Varnishkafka log producer on cp1081 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:39:52] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532710 (owner: 10Zfilipin) [14:40:06] RECOVERY - Webrequests Varnishkafka log producer on cp1081 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:40:12] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532710 (owner: 10Zfilipin) [14:40:58] RECOVERY - Check systemd state on cp1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:44] (03CR) 10BryanDavis: [C: 04-1] "I think there are a few parameter typos to fix" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/532487 (owner: 10Bstorm) [14:43:10] 10Operations, 10Traffic: Investigate HTTP/2 limits on trafficserver - https://phabricator.wikimedia.org/T231287 (10Vgutierrez) Further analysis shows that actually ATS is rate limiting PRIORITY frames even when they are disabled: ` proxy.config.http2.stream_priority_enabled: 0 proxy.config.http2.max_priority_f... [14:44:16] PROBLEM - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:25] (03CR) 10Jhedden: tools-prometheus: add an allowance for ssh monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/532487 (owner: 10Bstorm) [14:45:15] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.20 [14:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:20] (03PS10) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) [14:45:22] (03PS4) 10Mathew.onipe: elasticsearch: ship logs to local syslog server [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) [14:46:58] <_joe_> uhm mw1280 is unresponsive [14:47:11] onimisionipe: seems like you're on duty :) scap just said: `14:45:15 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', 'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet'] on mw1280.eqiad.wmnet returned [255]: ssh: connect to host mw1280.eqiad.wmnet port 22: [14:47:11] Connection timed out` [14:47:11] <_joe_> zeljkof: can you hold a sec? [14:47:24] (03CR) 10Mathew.onipe: elasticsearch: ship logs to local syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [14:47:31] <_joe_> yes, mw1280 seems to have gone down [14:47:42] _joe_: done with train, just noticed 1280 error [14:47:53] 10Operations, 10Analytics, 10Traffic: varnishkafka statsv and webrequest crashed on cp1081 - https://phabricator.wikimedia.org/T231331 (10ema) [14:48:02] <_joe_> you will need to resync I think [14:48:04] 10Operations, 10Analytics, 10Traffic: varnishkafka statsv and webrequest crashed on cp1081 - https://phabricator.wikimedia.org/T231331 (10ema) p:05Triage→03Normal [14:48:12] _joe_: now, or later? [14:48:38] (03CR) 10BBlack: [C: 03+2] anycast recdns: enable for codfw clients [puppet] - 10https://gerrit.wikimedia.org/r/526788 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [14:48:38] zeljkof: later, the host is currently down [14:48:53] (03CR) 10Bstorm: tools-prometheus: add an allowance for ssh monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/532487 (owner: 10Bstorm) [14:48:53] ema: ok, I'm around, let me know when I should do it [14:48:59] !log deploying anycast recdns resolv.conf setting to all codfw - T228190 [14:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:05] T228190: Roll out Anycast RecDNS to more servers - https://phabricator.wikimedia.org/T228190 [14:49:14] zeljkof: ack! [14:49:28] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10Nuria) @MSantos Maps request are available in the dataset i linked to, here they are split by referrer: https://bit.ly/327pZde Probably some time with @mpopov will... [14:49:35] <_joe_> !log powercycling mw1280 [14:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:52] is moritzm "out of the office"? [14:49:53] zeljkof: you are in good hands already :) [14:50:04] onimisionipe: :) [14:50:17] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1280.eqiad.wmnet [14:50:32] "extended afk" [14:50:33] ? [14:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:25] <_joe_> zeljkof: let's see if it comes back [14:51:27] urandom: yes he's out of the "office" [14:52:00] RECOVERY - Host mw1280 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [14:52:10] <_joe_> there it goes [14:52:30] (03CR) 10Dzahn: "i got an email when testing it. it was just delayed" [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) (owner: 10Aklapper) [14:52:33] :_joe_ should I run scap again? (now?) [14:52:39] <_joe_> !log running scap pull on mw1280 [14:52:50] <_joe_> zeljkof: I'm just not sure about the other hosts in that list [14:52:56] <_joe_> did scap complete on those? [14:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:27] _joe_: hm, I'm not sure :/ [14:53:38] _joe_: do you need help? [14:53:40] well, I don't think running scap again will break anything, right? [14:53:42] <_joe_> can you paste the output somewhere? [14:53:57] <_joe_> cdanis: if you want to help so that you know what to look at in this case [14:53:58] _joe_: sure, I'll create a phab paste [14:54:05] (03PS4) 10Bstorm: tools-prometheus: add an allowance for ssh monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532487 [14:54:24] <_joe_> zeljkof: yeah it's probably the safest solution [14:54:39] <_joe_> cdanis: trying to understand what happened to mw1280 [14:54:47] <_joe_> zeljkof: 1 sec though [14:54:57] _joe_: here it is https://phabricator.wikimedia.org/P8986 [14:55:01] <_joe_> cdanis: so, right now mw1280 is pooled=inactive [14:55:10] <_joe_> so if zeljkof runs scap, it won't sync to it [14:55:18] aye [14:55:32] sync-apaches: 100% (ok: 266; fail: 1; left: 0) [14:55:33] <_joe_> so now I'm setting it to pooled=no [14:55:41] <_joe_> yeah it's only that one [14:55:42] so I think scap worked fine on the other hosts [14:55:45] <_joe_> zeljkof: no need to scap again [14:55:51] ok, thanks [14:56:06] yes, scap seems to say it only had trouble with mw1280 [14:56:42] <_joe_> cdanis: so now I'm running on the host "pool" [14:56:50] <_joe_> after running scap pull [14:56:53] yepyep [14:57:17] <_joe_> interestingly, scap needs to run as non-root and confctl needs root :P [14:57:53] I need a short break, but I'll be around, in case of any trouble with the train [15:00:57] 10Operations, 10Analytics, 10Traffic: varnishkafka statsv and webrequest crashed on cp1081 - https://phabricator.wikimedia.org/T231331 (10Nuria) CPU at 100%: https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=cp1081&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cache... [15:03:40] (03PS5) 10Mathew.onipe: elasticsearch: ship logs to local syslog server [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) [15:05:50] (03PS12) 10Jhedden: openstack: Add codfw1dev nova API and metadata to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [15:07:30] (03PS2) 10Giuseppe Lavagetto: k8s::master: switch to profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/532661 [15:08:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18050/argon.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/532661 (owner: 10Giuseppe Lavagetto) [15:12:02] <_joe_> jenkins, behave would you [15:12:07] <_joe_> I got root and can be mean [15:13:27] <_joe_> apparently jenkins doesn't get intimidated that easy [15:13:42] (03CR) 10Bstorm: "There! With the fix from Jason in place: https://puppet-compiler.wmflabs.org/compiler1001/18064/" [puppet] - 10https://gerrit.wikimedia.org/r/532487 (owner: 10Bstorm) [15:14:03] (03CR) 10Bstorm: tools-prometheus: add an allowance for ssh monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/532487 (owner: 10Bstorm) [15:16:32] (03CR) 10Jhedden: [C: 03+1] tools-prometheus: add an allowance for ssh monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532487 (owner: 10Bstorm) [15:17:15] (03PS5) 10Bstorm: tools-prometheus: add an allowance for ssh monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532487 [15:17:23] (03PS3) 10Ottomata: modules::turnilo::templates::config.yaml.erb add edit_hourly [puppet] - 10https://gerrit.wikimedia.org/r/532467 (https://phabricator.wikimedia.org/T230963) (owner: 10Mforns) [15:17:36] (03CR) 10Ottomata: [C: 03+2] modules::turnilo::templates::config.yaml.erb add edit_hourly [puppet] - 10https://gerrit.wikimedia.org/r/532467 (https://phabricator.wikimedia.org/T230963) (owner: 10Mforns) [15:17:38] (03CR) 10Ottomata: [V: 03+2 C: 03+2] modules::turnilo::templates::config.yaml.erb add edit_hourly [puppet] - 10https://gerrit.wikimedia.org/r/532467 (https://phabricator.wikimedia.org/T230963) (owner: 10Mforns) [15:17:59] (03PS4) 10Ottomata: analytics::refinery::job::data_purge.pp: fix geoeditors retention period [puppet] - 10https://gerrit.wikimedia.org/r/532684 (https://phabricator.wikimedia.org/T231017) (owner: 10Mforns) [15:18:18] (03CR) 10Ottomata: [V: 03+2 C: 03+2] analytics::refinery::job::data_purge.pp: fix geoeditors retention period [puppet] - 10https://gerrit.wikimedia.org/r/532684 (https://phabricator.wikimedia.org/T231017) (owner: 10Mforns) [15:19:55] (03PS6) 10Bstorm: tools-prometheus: add an allowance for ssh monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532487 [15:19:58] (03PS1) 10Vgutierrez: Release 8.0.5-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/532723 (https://phabricator.wikimedia.org/T231287) [15:21:37] (03CR) 10Bstorm: [C: 03+2] tools-prometheus: add an allowance for ssh monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532487 (owner: 10Bstorm) [15:22:12] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/531972 (owner: 10CDanis) [15:23:13] (03CR) 10CDanis: [C: 03+2] dbctl: always validate vs JSON schema [software/conftool] - 10https://gerrit.wikimedia.org/r/531972 (owner: 10CDanis) [15:24:13] (03PS2) 10CRusnov: profile::netbox: Fix swift proxy content-disposition [puppet] - 10https://gerrit.wikimedia.org/r/532509 (https://phabricator.wikimedia.org/T209182) [15:24:48] (03CR) 10CRusnov: "> Patch Set 1: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/532509 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [15:24:53] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/532723 (https://phabricator.wikimedia.org/T231287) (owner: 10Vgutierrez) [15:24:58] (03PS3) 10CRusnov: profile::netbox: Fix swift proxy content-disposition [puppet] - 10https://gerrit.wikimedia.org/r/532509 (https://phabricator.wikimedia.org/T209182) [15:27:19] All clear for a quick config deploy? [15:27:26] (03PS2) 10Vgutierrez: Release 8.0.5-1wm4 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/532723 (https://phabricator.wikimedia.org/T231287) [15:28:33] (03CR) 10CRusnov: [C: 03+2] profile::netbox: Fix swift proxy content-disposition [puppet] - 10https://gerrit.wikimedia.org/r/532509 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [15:28:37] Taking silence as assent. [15:28:40] (03CR) 10Jforrester: [C: 03+2] Set `$wgRelatedArticlesDescriptionSource` to `wikidata` in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532675 (https://phabricator.wikimedia.org/T231279) (owner: 10DannyS712) [15:29:00] (03Merged) 10jenkins-bot: dbctl: always validate vs JSON schema [software/conftool] - 10https://gerrit.wikimedia.org/r/531972 (owner: 10CDanis) [15:35:48] (03Merged) 10jenkins-bot: Set `$wgRelatedArticlesDescriptionSource` to `wikidata` in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532675 (https://phabricator.wikimedia.org/T231279) (owner: 10DannyS712) [15:36:55] (03CR) 10jenkins-bot: Set `$wgRelatedArticlesDescriptionSource` to `wikidata` in config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532675 (https://phabricator.wikimedia.org/T231279) (owner: 10DannyS712) [15:41:28] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T231279 Set to (duration: 00m 54s) [15:41:31] !log That was T231279 Set `$wgRelatedArticlesDescriptionSource` to `wikidata` [15:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:41] T231279: Set $wgRelatedArticlesDescriptionSource = 'wikidata' in mediawiki-config - https://phabricator.wikimedia.org/T231279 [15:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:00] (03PS3) 10CDanis: dbctl: initial support for hostsByName [software/conftool] - 10https://gerrit.wikimedia.org/r/531973 (https://phabricator.wikimedia.org/T229676) [15:53:16] (03PS1) 10Urbanecm: Enable partial blocks on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532733 (https://phabricator.wikimedia.org/T231298) [16:00:04] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190827T1600). [16:00:05] Amir1: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:47] jouncebot: next [16:00:47] In 0 hour(s) and 59 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190827T1700) [16:04:20] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:52] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:04] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) > OO, when we reimage these, let's use Buster! :) I take it back, use Stretch. Buster ships with J... [16:07:15] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) [16:08:13] (03PS4) 10CDanis: dbctl: initial support for hostsByName [software/conftool] - 10https://gerrit.wikimedia.org/r/531973 (https://phabricator.wikimedia.org/T229676) [16:09:10] (03CR) 10CDanis: dbctl: initial support for hostsByName (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/531973 (https://phabricator.wikimedia.org/T229676) (owner: 10CDanis) [16:09:34] (03CR) 10CDanis: [C: 03+2] dbctl: initial support for hostsByName [software/conftool] - 10https://gerrit.wikimedia.org/r/531973 (https://phabricator.wikimedia.org/T229676) (owner: 10CDanis) [16:14:37] (03CR) 10jerkins-bot: [V: 04-1] dbctl: initial support for hostsByName [software/conftool] - 10https://gerrit.wikimedia.org/r/531973 (https://phabricator.wikimedia.org/T229676) (owner: 10CDanis) [16:15:27] o/ [16:15:37] (03CR) 10CDanis: [C: 03+2] "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/531973 (https://phabricator.wikimedia.org/T229676) (owner: 10CDanis) [16:18:26] (03Merged) 10jenkins-bot: dbctl: initial support for hostsByName [software/conftool] - 10https://gerrit.wikimedia.org/r/531973 (https://phabricator.wikimedia.org/T229676) (owner: 10CDanis) [16:22:40] godog: Can you take a look at the patch at the puppet swat? [16:22:45] or anyone [16:23:08] Amir1: doh, sorry about that -- totally missed puppet swat [16:23:34] godog: All good, I would be surprised if any one doesn't miss it since it's so barely used [16:24:04] yeah, also the way I have my irc client setup it doesn't highlight my nick [16:24:13] if a bot mentions me that is [16:25:12] Amir1: anyways, I don't feel comfortable merging that as puppet swat, no +1s and potentially big side effects aiui if sth goes wrong [16:26:02] godog: I understand, how can I move this forward? We looked at it in depth with Alex and Krinkle at Wikimania [16:26:19] probably best suited to be merged in collaboration with service ops folks [16:26:37] noted [16:27:57] (03PS2) 10Ayounsi: Fastnetmon, add notification script [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) [16:28:12] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:35] (03CR) 10Ayounsi: "Addressed!" (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [16:29:46] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:30] godog: is there anyone from serviceops around? [16:32:12] * Krinkle staging on mwdebug1002 [16:32:21] Rolling out https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/TwoColConflict/+/532721/ to join this week's branch [16:36:17] Amir1: I think joe is already off for the day but mutante might be around [16:36:34] Amir1: 16:36:20 sync-file failed: /srv/mediawiki-staging/php-1.34.0-wmf.20/extensions/TwoColConflict/.eslintrc.json is an invalid JSON file [16:36:45] what [16:36:51] how that happened [16:36:52] " // TODO recheck with the old interface code gone" [16:37:04] cdanis: thanks [16:37:16] Don't know how that passed before? [16:37:19] JSON not allowing comments 🤦 [16:37:48] Going for a narrower sync instead to bypass the issue [16:37:55] We are not touching that file [16:38:08] Krinkle: yeah, just the extension.json [16:38:36] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/TwoColConflict/extension.json: d6b5d441b, T229791 (duration: 00m 55s) [16:38:39] Amir1: not sure why eslint is tolerating it (or maybe it isn't?) afaik to do this you need to rename it to .yaml or .js (with exports=) instead [16:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:42] T229791: TwoColConflict and RevisionSlider resourceloader modules can be packed into one or two modules to save load time. - https://phabricator.wikimedia.org/T229791 [16:38:47] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/TwoColConflict/+/514227/ [16:40:42] Krinkle: hahaha, the old code is already gone, we can drop the checks and the comment (we dropped it in Wikimania if you remember) [16:42:13] Amir1: Hm.. which old code? [17:00:04] cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190827T1700). [17:03:14] Krinkle: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TwoColConflict/+/530565 [17:03:50] Amir1: ah ,didn't know about that. Guess I missed that at Wikimania? [17:04:00] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/Echo: 34084279089f (duration: 00m 55s) [17:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:12] yeah, that also dropped some modules [17:18:33] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10Mathew.onipe) a:03Mathew.onipe [17:19:36] 10Operations, 10Elasticsearch, 10SRE-tools, 10Discovery-Search (Current work): cookbook sre.elasticsearch.rolling-restart failed with cluster relforge - https://phabricator.wikimedia.org/T229807 (10Gehel) a:03Gehel [17:20:38] 10Operations, 10Elasticsearch, 10SRE-tools, 10Discovery-Search (Current work): cookbook sre.elasticsearch.rolling-restart failed with cluster relforge - https://phabricator.wikimedia.org/T229807 (10Gehel) 05Open→03Resolved [17:26:27] 10Operations, 10Discovery-Search (Current work): Run jstack / jmap / etc... with PrivateTmp=true - https://phabricator.wikimedia.org/T230774 (10Gehel) a:05Gehel→03Mathew.onipe Needs to be documented on https://phabricator.wikimedia.org/project/view/1227/ [17:29:26] (03CR) 10Aklapper: "Thanks Daniel for the quick review, fixing, merge!" [puppet] - 10https://gerrit.wikimedia.org/r/532711 (https://phabricator.wikimedia.org/T231320) (owner: 10Aklapper) [17:30:35] oh wow, the commit merged after all. I almost forgot about it [17:33:40] ugh, no it still hasn't [17:37:52] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:39:40] * Krinkle staging on mwdebug1002 [17:41:50] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@1869f79]: Fix definition endpoint TypeError (T230503) [17:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:58] T230503: Cannot read property 'length' of undefined for definition - https://phabricator.wikimedia.org/T230503 [17:42:45] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.20/includes/password/PasswordPolicyChecks.php: 098755622f7 (duration: 00m 54s) [17:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:30] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:46:30] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@1869f79]: Fix definition endpoint TypeError (T230503) (duration: 04m 39s) [17:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:08] Krinkle: Still busy in prod? [17:54:14] I don't thikn so. Let me double check if there is a patch in CI or landed meanwhile that I forgot about [17:54:19] Sure. [17:55:03] James_F: clear [17:55:08] Thanks. [17:55:11] jouncebot: next [17:55:11] In 5 hour(s) and 4 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190827T2300) [17:55:19] I'm stealing the conch for some UBNs. [18:38:15] (03Abandoned) 10Jforrester: DNM: Disable the videojs TMH beta feature due to resource issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530187 (owner: 10Jforrester) [18:40:06] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:43:52] (03CR) 10Dmaza: [C: 03+1] Enable partial blocks on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532733 (https://phabricator.wikimedia.org/T231298) (owner: 10Urbanecm) [18:48:18] apergos: I'm actually running that script right now, will ping / post on ticket when it is done and the first point is in graphite [18:53:47] I need to deploy 2 apache config patches, which will require disabling puppet on mw servers for a while (https://gerrit.wikimedia.org/r/c/operations/puppet/+/526755 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/526757). I'll start in ~10 minutes if no one objects [18:53:51] Finally. [18:54:13] (03PS5) 10Gehel: Add L and M to allowed statement starts [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev) [18:54:42] gehel: I'm here if anything is needed [18:55:46] SMalyshev: I'll need your help in testing on mwdebug [18:56:14] gehel: sure tell me when [18:56:18] (03CR) 10Reedy: Add L and M to allowed statement starts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev) [18:56:33] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.20/skins/MinervaNeue/skin.json: T231358 Fix userSandbox image path (duration: 00m 53s) [18:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:40] T231358: File '/srv/mediawiki/php-1.34.0-wmf.20/skins/MinervaNeue/resources/resources/skins.minerva.personalMenu.icons/userSandbox.svg' does not exist - https://phabricator.wikimedia.org/T231358 [18:57:39] gehel: ^ Seems odd to have LM only in upper case when Q and P are in both [18:57:48] (03CR) 10Gehel: [C: 03+1] Add L and M to allowed statement starts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev) [18:57:52] Reedy: yeah it's historic [18:58:01] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.20/includes/api/ApiQueryImageInfo.php: T231340 T231353 BadFileLookup::isBadFile() expects null, not false for the API (duration: 00m 53s) [18:58:07] we used to have some old code generate lowercases, it doesn't do that anymore [18:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:08] T231353: BadFileLookup.php: Argument 2 passed to MediaWiki\BadFileLookup::isBadFile() must implement interface MediaWiki\Linker\LinkTarget or be null, boolean given - https://phabricator.wikimedia.org/T231353 [18:58:08] T231340: TraditionalImageGallery.php: Argument 2 passed to MediaWiki\BadFileLookup::isBadFile() must implement interface MediaWiki\Linker\LinkTarget, bool given - https://phabricator.wikimedia.org/T231340 [18:58:12] Aha, ok. Yeah a comment is nice otherwise it looks like a bug :) [18:58:14] so for L/M no lowercases [18:58:18] Reedy: I answered in the CR, I'll add an inline comment to make it better [18:58:24] Reedy: I think it says so in the CR :) [18:58:54] (so q and p actually were a bug but we support them anyway :) [18:59:28] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.20/includes/gallery/ImageGalleryBase.php: T231340 T231353 BadFileLookup::isBadFile() expects null, not false for galleries (duration: 00m 53s) [18:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:03] (03PS6) 10Gehel: Add L and M to allowed statement starts [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev) [19:01:12] SMalyshev: can you check the comment ^ [19:03:04] sounds good [19:03:12] (03CR) 10Smalyshev: [C: 03+1] Add L and M to allowed statement starts [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev) [19:04:16] (03PS1) 10Dmaza: Remove unusued setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532769 [19:08:58] !log starting deployment of Apache config for lexemes / SDoC - T222321 [19:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:04] T222321: Make /entity/ alias work for Commons - https://phabricator.wikimedia.org/T222321 [19:11:44] (03PS2) 10Jforrester: Remove unusued wgEnableBlockNoticeStats setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532769 (owner: 10Dmaza) [19:11:44] (03CR) 10Jforrester: "Fixed title per https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532769 (owner: 10Dmaza) [19:12:16] (03CR) 10Jforrester: "I can just deploy this right now, if you want?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532769 (owner: 10Dmaza) [19:13:02] (03CR) 10Gehel: [C: 03+2] Add L and M to allowed statement starts [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev) [19:16:40] (03PS1) 10Zoranzoki21: Disable search engine indexing in some namespaces of Icelandic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532771 (https://phabricator.wikimedia.org/T231179) [19:16:47] SMalyshev: you can validate on mwdebug1001 [19:17:55] testing [19:18:01] the L/M thing right? [19:18:05] yep [19:19:02] hmm doesn't seem to do anything [19:19:43] wait maybe it's not on commons [19:19:45] let me see [19:20:21] SMalyshev: what's the URL you're testing? [19:21:23] https://www.wikidata.org/entity/statement/L40053-5aa77d7a-4c9e-ba1c-255b-3c8e4ab60d5d [19:21:49] redirect with Q works but with L doesn't [19:22:11] maybe that happens on some other place than mwdebug? [19:22:23] is that the only form? the regex we have expects ([QqPpLM]\d+) so only digits after the L [19:22:53] not UUID (or what looks like UUID) [19:22:54] ([QqPp]\d+).* [19:22:55] not only digits [19:22:56] it's the same as Q [19:23:14] the Q one works (just replace L with Q and see) [19:23:20] but the L one does not [19:24:09] I'm rolling back until I understand [19:24:58] I suspect it somehow does not get to mwdebug... [19:24:58] (03PS1) 10Gehel: Revert "Add L and M to allowed statement starts" [puppet] - 10https://gerrit.wikimedia.org/r/532772 [19:25:16] gehel: did Q work for you? this means the patch was not applied [19:25:26] yep, Q works [19:25:53] we're probably just not testing what we think we're testing [19:26:29] possibly. Anything I can do here? [19:26:44] nope, I need to do some digging first [19:26:57] ok tell me if I can help with anything [19:27:15] There's probably something obvious, different between wikidata and "normal" wikis [19:30:46] quite possible, wouldn't be surprised [19:31:17] (03CR) 10Gehel: [V: 03+2 C: 03+2] Revert "Add L and M to allowed statement starts" [puppet] - 10https://gerrit.wikimedia.org/r/532772 (owner: 10Gehel) [19:33:00] (03CR) 10Dmaza: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532769 (owner: 10Dmaza) [19:34:10] PROBLEM - ElasticSearch shard size check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - commonswiki_content_1556235298(60.333333333333336gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [19:39:24] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:45:42] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluste [19:45:42] ethod=GET [19:50:24] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:50:31] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T231199 (10Cmjohnson) @Marostegui Replaced the disk with one of the few remaining used spares. I did notice 2 more disks are starting to fail....you may want to speed up the decom process. [19:51:15] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10Cmjohnson) a:05Cmjohnson→03wiki_willy The reason for the task being declined. I verified that the failed disk is indeed 1.9TB but is a SSD. The original order and showing on the disk caddy label is for... [19:54:26] 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Revisit default settings for c-foreach-restart - https://phabricator.wikimedia.org/T198787 (10Eevans) >>! In T198787#5439421, @WDoranWMF wrote: > @Eevans Who could look at this, would this be a good task for @Clar... [19:55:11] 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Revisit default settings for c-foreach-restart - https://phabricator.wikimedia.org/T198787 (10Eevans) [19:55:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1021 with 10G interfaces - https://phabricator.wikimedia.org/T229873 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Can you run 10G DAC cables in rack B4. Connect to the 10G ports on the serve... [19:55:58] (03CR) 10Jbond: "thanks looks good to me just one minor nit" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [19:56:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1022 with 10G interfaces - https://phabricator.wikimedia.org/T229872 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Can you run 10G DAC cables in rack B7. Connect to the 10G ports on the serve... [19:57:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Cmjohnson) @Andrew This server will require a physical move to B2, B4 or B7. I will do this one last, working on cabling 1021/10... [20:00:21] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Can you move these servers as evenly as you can into r... [20:04:12] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 10 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214 (10LGoto) [20:13:03] 10Operations, 10ops-eqiad: Degraded RAID on sulfur - https://phabricator.wikimedia.org/T229134 (10wiki_willy) @Volans - hey Riccardo, not sure if you're the right person for this, but thought I'd try asking you. Is there a different output we can get for this alert, to help us isolate the disk issue a bit mor... [20:15:16] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 9 others: RFC: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847 (10LGoto) [20:16:04] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10wiki_willy) @Jclark-ctr - can we resolve this task? Thanks, Willy [20:16:53] 10Operations, 10ops-eqiad: Degraded RAID on sulfur - https://phabricator.wikimedia.org/T229134 (10Volans) @wiki_willy The failed because of host unreachable, but is this still a commissioned host? I cannot see the records in the DNS repo, just the management ones are there. Also see its decom task: T224475 [20:23:00] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:28:53] 10Operations, 10ops-eqiad: Degraded RAID on sulfur - https://phabricator.wikimedia.org/T229134 (10wiki_willy) @Volans - ah that makes. Thanks, let's just resolve out this task then. [20:30:05] (03CR) 10Jforrester: [C: 03+2] Remove unusued wgEnableBlockNoticeStats setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532769 (owner: 10Dmaza) [20:30:11] (03PS3) 10Jforrester: Remove unusued wgEnableBlockNoticeStats setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532769 (owner: 10Dmaza) [20:30:19] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532769 (owner: 10Dmaza) [20:32:20] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [20:34:41] (03Merged) 10jenkins-bot: Remove unusued wgEnableBlockNoticeStats setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532769 (owner: 10Dmaza) [20:35:49] (03Abandoned) 10Mholloway: MachineVision (beta): Update handler services to support label lookups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530605 (owner: 10Mholloway) [20:36:36] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove unusued wgEnableBlockNoticeStats setting (duration: 00m 54s) [20:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:47] (03CR) 10jenkins-bot: Remove unusued wgEnableBlockNoticeStats setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532769 (owner: 10Dmaza) [20:48:52] 10Operations, 10ops-eqiad: Degraded RAID on sulfur - https://phabricator.wikimedia.org/T229134 (10wiki_willy) 05Open→03Resolved [20:50:44] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:54] RECOVERY - MegaRAID on db1063 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:55:53] (03PS1) 10CDanis: swiftrepl: bring close to as-is in production [software] - 10https://gerrit.wikimedia.org/r/532793 [20:56:21] (03PS2) 10CDanis: swiftrepl: bring close to as-is in production [software] - 10https://gerrit.wikimedia.org/r/532793 (https://phabricator.wikimedia.org/T231110) [21:00:42] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:11:06] !log disable both sides of the reline link between knams and esams - T230448 [21:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:12] T230448: Aug 28th: turn off 1/3 esams-knams lasers in advance of Relined PA-988002 maintenance - https://phabricator.wikimedia.org/T230448 [21:15:59] (03PS14) 10Thcipriani: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 [21:18:10] 10Operations: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10AAlikhan) [21:19:35] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Varnent) As I understand it - Legal would like the existing microsite located at transparency.wikimedia.org to be relocated to transparency.wikimedi... [21:21:06] 10Operations, 10Traffic, 10Readers-Web-Backlog (Needs Product Owner Decisions): [Bug] iPadOS 13 shows the desktop version of Safari with a broken layout - https://phabricator.wikimedia.org/T229875 (10ovasileva) [21:24:57] 10Operations: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10BBlack) 05Open→03Stalled Holding on this until early next week, as we have too many decision-makers on vacation this week, and there are policy and security implications to granting DKIM for `@wikimedia.org` to a third party vi... [21:25:32] (03CR) 10BryanDavis: "legoktm: this apparently rotted in gerrit for a year :(" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453666 (https://phabricator.wikimedia.org/T169451) (owner: 10Legoktm) [21:27:32] (03CR) 10BryanDavis: [C: 03+1] "Obsoleted by I5a4062cddb0f6b5e0a5c16b25cc08e3e7ddbc150 ?" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/495291 (https://phabricator.wikimedia.org/T216712) (owner: 10Legoktm) [21:28:03] (03CR) 10BryanDavis: [C: 04-1] php72: Switch from thirdparty/php72 to component/php72 [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/495291 (https://phabricator.wikimedia.org/T216712) (owner: 10Legoktm) [21:28:35] (03CR) 10BryanDavis: [C: 03+2] jessie: Work around removal of jessie-backports [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527652 (owner: 10BryanDavis) [21:28:44] (03CR) 10BryanDavis: [C: 03+2] locales-extended: Add support for Korean [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527653 (https://phabricator.wikimedia.org/T130532) (owner: 10BryanDavis) [21:29:13] (03Merged) 10jenkins-bot: jessie: Work around removal of jessie-backports [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527652 (owner: 10BryanDavis) [21:29:17] (03Merged) 10jenkins-bot: locales-extended: Add support for Korean [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/527653 (https://phabricator.wikimedia.org/T130532) (owner: 10BryanDavis) [21:30:08] 10Operations, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10Clarakosi) [21:30:12] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10BBlack) @Varnent: For the redirects: just the main https://transparency.wikimedia.org/ URL? Or also the sub-pages like https://transparency.wikimed... [21:30:24] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10BBlack) 05Stalled→03Open [21:34:06] (03PS3) 10Ayounsi: Fastnetmon, add notification script [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) [21:34:32] 10Operations, 10Traffic, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10Jdforrester-WMF) Tagging in Traffic; this is the server (cp1075) running ATS not Varnish, right? [21:34:40] (03CR) 10jerkins-bot: [V: 04-1] Fastnetmon, add notification script [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [21:36:41] (03CR) 10Ayounsi: Fastnetmon, add notification script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [21:37:00] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [21:37:20] (03CR) 10jerkins-bot: [V: 04-1] Fastnetmon, add notification script [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [21:41:43] 10Operations, 10Traffic, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10BBlack) a:03ema Assigning to @ema to investigate (yes, this is the live test server for ATS backends for these servers). Most likely the problem is specific to ATS<->docker-regist... [21:42:13] (03CR) 10BryanDavis: [C: 03+2] Apply black formatting [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528177 (owner: 10Bstorm) [21:44:43] (03Merged) 10jenkins-bot: Apply black formatting [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/528177 (owner: 10Bstorm) [21:44:50] 10Operations, 10Traffic, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10greg) p:05Triage→03Unbreak! This is blocking CI runs. [21:47:11] 10Operations, 10Traffic, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10ayounsi) Note that it's breaking Jenkins on the Puppet repo (goes straight to -1). https://integration.wikimedia.org/ci/job/operations-puppet-tests-stretch-docker/20234/console [21:50:35] (03PS1) 10Cwhite: add the option of passing a custom metrics context manager to EndpointRequest [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 [21:50:55] confctl select name=cp1075.eqiad.wmnet,service=ats-be set/pooled=no [21:51:02] heh, meant to log that :) [21:51:07] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet,service=ats-be [21:51:13] (03CR) 10Cwhite: [C: 03+1] mediawiki: remove per-host high CPU alerts [puppet] - 10https://gerrit.wikimedia.org/r/531142 (https://phabricator.wikimedia.org/T230396) (owner: 10Filippo Giunchedi) [21:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:15] it self-logs anyways! [21:51:35] (03CR) 10Cwhite: [C: 03+1] prometheus: bump logstash rate of ingestion threshold [puppet] - 10https://gerrit.wikimedia.org/r/532707 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [21:51:58] 10Operations, 10Traffic, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10BBlack) Depooled cp1075 `ats-be` service via confctl, can someone retry and confirm mitigated? [21:53:15] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [21:53:38] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10wiki_willy) @Bstorm - I was able to confirm we originally ordered this machine to include 1.6tb drives via https://phabricator.wikimedia.org/T155075 , but wasn't able to find any other tasks that showed whe... [21:58:44] 10Operations, 10Traffic, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10Clarakosi) >>! In T231388#5443941, @BBlack wrote: > Depooled cp1075 `ats-be` service via confctl, can someone retry and confirm mitigated? It works! [21:59:38] 10Operations, 10Traffic, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10BBlack) Please leave this open for now so @ema can look at a more-permanent fixup tomorrow! [22:00:30] 10Operations, 10Traffic, 10serviceops: Error pulling image from docker registry - https://phabricator.wikimedia.org/T231388 (10Jdforrester-WMF) p:05Unbreak!→03Normal De-prioritising. [22:11:25] (03CR) 10Volans: [C: 03+1] "Thanks for all the fixes, a suggestion and a nit inline, but LGTM." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [22:11:30] 10Operations, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10Varnent) [22:22:42] I'm conching. [22:29:04] (03PS4) 10Ayounsi: Fastnetmon, add notification script [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) [22:29:56] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/VisualEditor/lib/ve/src/ce/nodes/ve.ce.GeneratedContentNode.js: T231381 Follow-up I196f5bd88: Fix typo (set node=this) (duration: 00m 57s) [22:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:02] T231381: [Regression wmf.20] Uncaught TypeError: Cannot read property 'getRoot' of undefined appears when adding new citation/chemical and math formula - https://phabricator.wikimedia.org/T231381 [22:31:32] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18068/netflow1001.eqiad.wmnet/" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) (owner: 10Ayounsi) [22:31:41] (03PS5) 10Ayounsi: Fastnetmon, add notification script [puppet] - 10https://gerrit.wikimedia.org/r/531943 (https://phabricator.wikimedia.org/T226810) [22:35:29] Conch released. [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190827T2300). [23:00:04] Jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:02:17] o/ [23:02:20] Jdlrobson: I can SWAT today! [23:02:25] Urbanecm: thanks! :D [23:02:34] (was going to write a msg in another chan, thanks for showing :)) [23:05:26] Jdlrobson: +2'ed your backports, going to deploy few config patches while waiting on CI [23:07:08] Jdlrobson: why doesn't https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/532756 do the same change as https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/532755 through? [23:07:37] (03CR) 10Urbanecm: [C: 03+2] Whitelist *.wikimedia.cz in wgCopyUploadsDomains for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532438 (https://phabricator.wikimedia.org/T231247) (owner: 10Urbanecm) [23:07:45] (03PS2) 10Urbanecm: Whitelist *.wikimedia.cz in wgCopyUploadsDomains for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532438 (https://phabricator.wikimedia.org/T231247) [23:07:50] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532438 (https://phabricator.wikimedia.org/T231247) (owner: 10Urbanecm) [23:13:21] nice catch Urbanecm they should be the same [23:13:24] one was an older version [23:13:38] would have had same effect but probably better to use the merged version [23:13:38] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 27315 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [23:13:49] Urbanecm: have amended https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/MinervaNeue/+/532756/ [23:14:05] (03Merged) 10jenkins-bot: Whitelist *.wikimedia.cz in wgCopyUploadsDomains for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532438 (https://phabricator.wikimedia.org/T231247) (owner: 10Urbanecm) [23:14:26] (03CR) 10jenkins-bot: Whitelist *.wikimedia.cz in wgCopyUploadsDomains for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532438 (https://phabricator.wikimedia.org/T231247) (owner: 10Urbanecm) [23:14:26] thanks Jdlrobson, +2'ed [23:16:00] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 1687ec9: Whitelist *.wikimedia.cz in wgCopyUploadsDomains for commonswiki (T231247) (duration: 00m 54s) [23:16:03] (03PS2) 10Urbanecm: Enable partial blocks on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532733 (https://phabricator.wikimedia.org/T231298) [23:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:07] T231247: Whitelist *.wikimedia.cz in wgCopyUploadsDomains for commonswiki - https://phabricator.wikimedia.org/T231247 [23:16:10] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532733 (https://phabricator.wikimedia.org/T231298) (owner: 10Urbanecm) [23:17:16] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:18:13] (03Merged) 10jenkins-bot: Enable partial blocks on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532733 (https://phabricator.wikimedia.org/T231298) (owner: 10Urbanecm) [23:19:26] (03CR) 10jenkins-bot: Enable partial blocks on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532733 (https://phabricator.wikimedia.org/T231298) (owner: 10Urbanecm) [23:21:56] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:22:25] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 3704bb7: Enable partial blocks on ruwiki (T231298) (duration: 00m 54s) [23:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:31] T231298: Enable Partial blocks on Russian Wikipedia - https://phabricator.wikimedia.org/T231298 [23:24:23] (03PS8) 10Urbanecm: General cleanup of initialize settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [23:25:18] (03CR) 10Urbanecm: [C: 04-1] "Per Krinkle" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [23:26:32] Jdlrobson: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/532753 is on mwdebug1002, please test and let me knwo [23:26:34] *know [23:26:44] on it [23:28:05] Urbanecm: you can sync [23:28:32] thanks Jdlrobson [23:29:10] RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [23:30:18] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/MobileFrontend/resources/dist/: SWAT: a109b25: Build assets reflecting edit change (duration: 00m 55s) [23:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:29] (03PS9) 10DannyS712: General cleanup of initialize settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) [23:30:38] done Jdlrobson. The other one is still not merged, will ping you once it's on mwdebug [23:30:56] sounds good [23:31:00] (03CR) 10DannyS712: "Resolved Krinkle's note in PS9" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [23:32:15] (03PS2) 10Urbanecm: [sqwikiquote] Enable WikiLove and SandboxLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530381 (https://phabricator.wikimedia.org/T230390) (owner: 10Jforrester) [23:32:29] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530381 (https://phabricator.wikimedia.org/T230390) (owner: 10Jforrester) [23:34:18] (03Merged) 10jenkins-bot: [sqwikiquote] Enable WikiLove and SandboxLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530381 (https://phabricator.wikimedia.org/T230390) (owner: 10Jforrester) [23:35:08] (03PS14) 10Urbanecm: Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 (owner: 10Viztor) [23:36:05] !log Run mwscript extensions/WikimediaMaintenance/createExtensionTables.php sqwikiquote wikilove (T230390) [23:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:10] T230390: Activate WikiLove and SandboxLink extensions for sq.wikiquote - https://phabricator.wikimedia.org/T230390 [23:39:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 1422870: [sqwikiquote] Enable WikiLove and SandboxLink (T230390) (duration: 00m 54s) [23:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:59] Jdlrobson: https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/532756 is on mwdebug1002, please test and let me know. [23:41:31] on it [23:41:55] and confirmed Urbanecm ! (easy one!) [23:42:04] thanks Jdlrobson [23:43:44] (03CR) 10jenkins-bot: [sqwikiquote] Enable WikiLove and SandboxLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530381 (https://phabricator.wikimedia.org/T230390) (owner: 10Jforrester) [23:44:07] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.20/skins/MinervaNeue/: SWAT: 4d04797: Restore contributions icon to non-AMC menu (T231363) (duration: 00m 54s) [23:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:13] T231363: Contributions icon is missing from main menu in non-AMC mode for logged in users - https://phabricator.wikimedia.org/T231363 [23:44:31] Jdlrobson: synced. [23:45:05] !log Evening SWAT done [23:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:14] thanks for all your help Urbanecm ! [23:50:04] happy to help Jdlrobson