[00:13:01] PROBLEM - dump of m1 in eqiad on db2093 is CRITICAL: dump for m1 at eqiad taken more than 8 days ago: Most recent backup 2020-06-16 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:22:05] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:06] (03CR) 10Mstyles: sdoc gui custom config (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [00:41:56] (03PS1) 10Reedy: Use strucuted logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 [01:05:17] PROBLEM - Disk space on an-launcher1001 is CRITICAL: DISK CRITICAL - free space: / 3440 MB (3% inode=95%): /tmp 3440 MB (3% inode=95%): /var/tmp 3440 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-launcher1001&var-datasource=eqiad+prometheus/ops [01:08:17] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:08:53] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:37:26] (03PS2) 10Reedy: Use structured logging fields for xff logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607389 [02:50:13] PROBLEM - dump of m5 in eqiad on db2093 is CRITICAL: dump for m5 at eqiad taken more than 8 days ago: Most recent backup 2020-06-16 02:44:42 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:55:12] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10Dzahn) @KFrancis Thank you for the update! It's possible that someone else will continue this ticket because we have a rotating duty to handle access requests. Either... [03:10:07] PROBLEM - Disk space on an-launcher1001 is CRITICAL: DISK CRITICAL - free space: / 3343 MB (3% inode=95%): /tmp 3343 MB (3% inode=95%): /var/tmp 3343 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-launcher1001&var-datasource=eqiad+prometheus/ops [04:39:21] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:45:23] 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Marostegui) No, a BBU failure should not trigger a host reboot. Unfortunately, this is something we've seen with HP hosts over the years. Dell has also shown (sometimes) similar behaviours, which ended up with RAID controllers r... [04:50:15] (03PS1) 10Marostegui: Revert "dbproxy1012,1014: Place db1097 as standby host." [puppet] - 10https://gerrit.wikimedia.org/r/607406 [04:50:38] (03CR) 10Marostegui: "This has worked fine, so reverting to the original config where db1117 is the standby host." [puppet] - 10https://gerrit.wikimedia.org/r/607406 (owner: 10Marostegui) [04:52:01] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1012,1014: Place db1097 as standby host." [puppet] - 10https://gerrit.wikimedia.org/r/607406 (owner: 10Marostegui) [04:53:03] !log Reload haproxy on dbproxy1012 and dbproxy1014 [04:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:37] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1085 for MCR schema change', diff saved to https://phabricator.wikimedia.org/P11643 and previous config saved to /var/cache/conftool/dbconfig/20200624-050235-marostegui.json [05:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:23] !log Remove revision triggers from db1125:·3316 [05:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:18] 10Operations, 10DBA, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) It should be ok to give them `CREATE TABLE` and even some need `CREATE TEMPORARY TABLE`, I think those two should be fine if they are needed. I would even... [05:14:56] !log Remove grants from dbproxy1008 - T231280 T255406 [05:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:02] T255406: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 [05:15:02] T231280: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 [05:17:05] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) dbproxy1008 grants removed from m3 (and also checked all the other mX sections): ` root@cumin2001:/home/marostegui# ./section m3 | while read host port;... [05:19:40] (03PS1) 10Marostegui: production-m3.sql: Remove grants for dbproxy1008 [puppet] - 10https://gerrit.wikimedia.org/r/607407 (https://phabricator.wikimedia.org/T255406) [05:27:09] (03CR) 10Marostegui: [C: 03+2] production-m3.sql: Remove grants for dbproxy1008 [puppet] - 10https://gerrit.wikimedia.org/r/607407 (https://phabricator.wikimedia.org/T255406) (owner: 10Marostegui) [05:28:11] PROBLEM - dump of m3 in eqiad on db2093 is CRITICAL: dump for m3 at eqiad taken more than 8 days ago: Most recent backup 2020-06-16 05:20:35 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:28:24] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [05:30:08] (03PS1) 10Marostegui: report_users: Remove dbproxy1008 [software] - 10https://gerrit.wikimedia.org/r/607408 (https://phabricator.wikimedia.org/T255406) [05:32:26] (03PS1) 10Marostegui: mariadb: Remove dbproxy1008 [puppet] - 10https://gerrit.wikimedia.org/r/607409 (https://phabricator.wikimedia.org/T255406) [05:33:25] !log marostegui@cumin2001 START - Cookbook sre.hosts.decommission [05:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:49] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:34:09] (03CR) 10Marostegui: [C: 03+2] report_users: Remove dbproxy1008 [software] - 10https://gerrit.wikimedia.org/r/607408 (https://phabricator.wikimedia.org/T255406) (owner: 10Marostegui) [05:34:28] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:34:28] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove dbproxy1008 [puppet] - 10https://gerrit.wikimedia.org/r/607409 (https://phabricator.wikimedia.org/T255406) (owner: 10Marostegui) [05:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:33] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin2001 for hosts: `dbproxy1008.eqiad.wmnet` - dbproxy1008.eqiad... [05:35:26] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [05:35:37] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:37:04] (03PS1) 10Marostegui: templates: Remove dbproxy1008 production entries [dns] - 10https://gerrit.wikimedia.org/r/607410 (https://phabricator.wikimedia.org/T255406) [05:41:01] (03PS1) 10Marostegui: check-microcode.py: Remove dbproxy1008 [puppet] - 10https://gerrit.wikimedia.org/r/607411 (https://phabricator.wikimedia.org/T255406) [05:41:06] (03CR) 10Marostegui: [C: 03+2] templates: Remove dbproxy1008 production entries [dns] - 10https://gerrit.wikimedia.org/r/607410 (https://phabricator.wikimedia.org/T255406) (owner: 10Marostegui) [05:42:09] (03CR) 10jerkins-bot: [V: 04-1] check-microcode.py: Remove dbproxy1008 [puppet] - 10https://gerrit.wikimedia.org/r/607411 (https://phabricator.wikimedia.org/T255406) (owner: 10Marostegui) [05:44:37] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/607411 (https://phabricator.wikimedia.org/T255406) (owner: 10Marostegui) [05:53:06] (03CR) 10Marostegui: "can you paste a final puppet compiler url result here too? (for the record)" [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [05:55:24] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [05:59:47] !log disable peering BGP sessions on AMS-IX - T253970 [05:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:52] T253970: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 [06:02:01] 10Operations, 10ops-eqiad, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [06:02:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) This is ready for #dc-ops. [06:04:00] (03CR) 10Nikerabbit: [C: 03+1] Set proper language code for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607235 (https://phabricator.wikimedia.org/T250810) (owner: 10DCausse) [06:05:22] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:08:10] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 90 probes of 564 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:10:58] 10Operations, 10Wikimedia-Mailing-lists: Requesting for new mailing list - https://phabricator.wikimedia.org/T256193 (10Diptanshu.D) [06:12:20] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 91, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:13:50] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 564 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:16:26] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 15, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:17:26] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:19:34] (03PS5) 10Nikerabbit: Remove TranslationNotifications user settings 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780) (owner: 10DannyS712) [06:19:36] (03PS1) 10Nikerabbit: Remove TranslationNotifications user settings 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607414 (https://phabricator.wikimedia.org/T144780) [06:20:14] 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list for ILAE English Wikipedia project - https://phabricator.wikimedia.org/T256193 (10Aklapper) [06:28:48] !log enable peering BGP sessions on AMS-IX - T253970 [06:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:52] T253970: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 [06:29:47] 10Operations, 10netops: No LACP info for cr2-esams:ae2 - https://phabricator.wikimedia.org/T253970 (10ayounsi) 05Open→03Resolved a:03ayounsi LACP is now up and running. [06:53:39] !log draining ganeti1009 for eventual reboot [06:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:08] RECOVERY - Disk space on an-launcher1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-launcher1001&var-datasource=eqiad+prometheus/ops [06:59:03] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10ayounsi) Indeed, my bad. `mr1-codfw# replace pattern ge-0/0/7 with ge-0/0/5` Done. [06:59:14] (03PS2) 10Muehlenhoff: check-microcode.py: Remove dbproxy1008 [puppet] - 10https://gerrit.wikimedia.org/r/607411 (https://phabricator.wikimedia.org/T255406) (owner: 10Marostegui) [07:05:11] (03CR) 10Muehlenhoff: [C: 03+2] check-microcode.py: Remove dbproxy1008 [puppet] - 10https://gerrit.wikimedia.org/r/607411 (https://phabricator.wikimedia.org/T255406) (owner: 10Marostegui) [07:06:00] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 58 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:09:30] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 58 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:11:40] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:15:20] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 47 probes of 567 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:16:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) a:05Marostegui→03wiki_willy [07:16:20] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:21:10] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:37] !log restarting blazegraph on wdqs1007 [07:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:32] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:24:30] PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 1.162e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:30:02] (03PS1) 10Marostegui: install_server: Reimage dbproxy1021 to buster [puppet] - 10https://gerrit.wikimedia.org/r/607436 (https://phabricator.wikimedia.org/T255408) [07:31:03] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage dbproxy1021 to buster [puppet] - 10https://gerrit.wikimedia.org/r/607436 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [07:32:59] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1007 is CRITICAL: 1.164e+05 ge 4.32e+04 Gehel lag catching up after restart of blazegraph https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:42:03] (03PS10) 10Elukey: Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) [07:47:31] (03CR) 10Elukey: "new pcc https://puppet-compiler.wmflabs.org/compiler1001/23410/" [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [07:53:17] (03PS3) 10Privacybatm: transferpy: Use logging package instead of print statements [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) [07:54:58] (03CR) 10Privacybatm: transferpy: Use logging package instead of print statements (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) (owner: 10Privacybatm) [07:59:52] (03PS1) 10Elukey: Reimage db1108 to Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/607438 (https://phabricator.wikimedia.org/T234826) [08:00:44] !log disable puppet in eqiad to unblock puppetdb1002 VM migration [08:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:15] (03CR) 10Marostegui: [C: 03+1] Reimage db1108 to Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/607438 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [08:01:20] thanks :) [08:01:31] elukey check the above !log from moritzm as that will affect you :) [08:01:31] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10danshick-wmde) Signed. Thank you all! [08:01:44] (03CR) 10Elukey: [C: 03+2] Reimage db1108 to Debian Buster [puppet] - 10https://gerrit.wikimedia.org/r/607438 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [08:02:11] yes yes I am only merging, will wait for Moritz to finish [08:02:25] well I'll also wait to puppet-merge [08:02:28] just in case :) [08:02:30] elukey: should be 5 minutes max, I'll ping you [08:02:51] np I am not in a rush, a good excuse for a coffee (you know these italians) [08:03:42] happy to provide coffee breaks along :-) [08:04:29] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [08:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:31] !log marostegui@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [08:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:42] !log re-enable puppet in eqiad [08:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:49] elukey: done [08:08:06] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10guergana.tzatchkova) [08:13:36] (03CR) 10Jcrespo: [C: 03+1] "Check the grammar of description:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [08:18:38] moritzm: ack thanks! [08:21:25] (03PS2) 10Muehlenhoff: Reduce TTL for IDP CNAMEs to 5 minutes [dns] - 10https://gerrit.wikimedia.org/r/607299 [08:22:15] (03PS1) 10Kormat: Revert "mariadb: Silence notifications for db1088" [puppet] - 10https://gerrit.wikimedia.org/r/607440 (https://phabricator.wikimedia.org/T255927) [08:22:17] (03PS1) 10Elukey: Revert "Reimage db1108 to Debian Buster" [puppet] - 10https://gerrit.wikimedia.org/r/607441 [08:22:46] :( [08:23:55] (03CR) 10Elukey: [C: 03+2] "While doing the last checks I remembered about the 'staging' database, that seems to be used by one old ReportUpdater thing. I'll follow u" [puppet] - 10https://gerrit.wikimedia.org/r/607441 (owner: 10Elukey) [08:24:30] (03CR) 10Muehlenhoff: [C: 03+2] Reduce TTL for IDP CNAMEs to 5 minutes [dns] - 10https://gerrit.wikimedia.org/r/607299 (owner: 10Muehlenhoff) [08:24:47] (03CR) 10Kormat: [C: 03+2] Revert "mariadb: Silence notifications for db1088" [puppet] - 10https://gerrit.wikimedia.org/r/607440 (https://phabricator.wikimedia.org/T255927) (owner: 10Kormat) [08:27:08] RECOVERY - dump of m5 in eqiad on db2093 is OK: Last dump for m5 at eqiad (db1117.eqiad.wmnet:3325) taken on 2020-06-24 07:33:37 (14 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:31:06] (03CR) 10Muehlenhoff: "I don't think we really need this, the repo sees little activity and nothing reacts on it automatically. The sole purpose of that profile " [puppet] - 10https://gerrit.wikimedia.org/r/607281 (owner: 10Muehlenhoff) [08:31:21] !log kormat@cumin1001 dbctl commit (dc=all): 'Pool db1088 @ 20% into s6 T255927', diff saved to https://phabricator.wikimedia.org/P11645 and previous config saved to /var/cache/conftool/dbconfig/20200624-083120-kormat.json [08:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:25] T255927: db1088 crashed - https://phabricator.wikimedia.org/T255927 [08:33:18] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [08:37:18] (03CR) 10Marostegui: [C: 03+1] "Compiler looks good" [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [08:40:30] !log prune remaining nginx packages on mw* servers T255565 [08:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:39] T255565: Remaining nginx packages on some mw servers - https://phabricator.wikimedia.org/T255565 [08:42:54] (03CR) 10Marostegui: [C: 03+1] "Looks good now, the files for each instances are now created and the basedir is also fixed with my yesterday's merge." [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [08:43:08] * elukey dances [08:45:41] (03CR) 10Jcrespo: "I haven't reviewed, but I just wanted to say thank you! <3" [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [08:47:20] (03PS1) 10Marostegui: install_server: Reimage db2120 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/607447 (https://phabricator.wikimedia.org/T250666) [08:47:54] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2120 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/607447 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [08:49:30] (03CR) 10Kormat: [C: 03+2] mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [08:52:42] 10Operations, 10SRE-Access-Requests: Requesting access to centralauth database for Jennifer Wang - https://phabricator.wikimedia.org/T255836 (10ema) [08:53:56] RECOVERY - dump of m1 in eqiad on db2093 is OK: Last dump for m1 at eqiad (db1117.eqiad.wmnet:3321) taken on 2020-06-24 07:34:41 (22 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:59:18] (03PS10) 10Privacybatm: transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) [09:01:01] (03CR) 10Privacybatm: "> Patch Set 9: Code-Review+1" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [09:01:24] 10Operations: Create ssh keypair for integration/docroot deployment with scap - https://phabricator.wikimedia.org/T256138 (10ema) p:05Triage→03Medium a:03ema [09:01:56] RECOVERY - dump of m3 in eqiad on db2093 is OK: Last dump for m3 at eqiad (db1117.eqiad.wmnet:3323) taken on 2020-06-24 07:34:41 (57 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:02:14] (03CR) 10Privacybatm: [C: 03+1] "> Patch Set 10:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [09:04:46] (03CR) 10Jcrespo: [C: 03+2] transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [09:10:38] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [09:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:44] (03CR) 10Volans: "Glad to see efforts to automate complex procedures! Some comment/question/doubt inline." (0318 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [09:13:14] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:28] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10ema) p:05Triage→03Medium [09:14:56] (03PS3) 10Hashar: ci: switch integration.wikimedia.org to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T149924) [09:22:22] 10Operations, 10LDAP-Access-Requests: NDA for superset access request from WMDE employee danshick - https://phabricator.wikimedia.org/T254442 (10ema) @KFrancis: let me know when Dan is added to the NDA and MOU spreadsheet so that I can carry on with this request. Thanks! [09:23:23] (03CR) 10Hnowlan: [C: 03+1] EventBus: Emit kafka purges for everything [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607298 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [09:29:15] (03PS6) 10JMeybohm: WIP: chartmuseum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) [09:30:52] (03CR) 10JMeybohm: WIP: chartmuseum: Add initial module, profile and role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:35:31] (03PS2) 10Arturo Borrero Gonzalez: toolforge: mailrelay: enforce ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/607320 (https://phabricator.wikimedia.org/T175964) [09:36:25] !log kormat@cumin1001 dbctl commit (dc=all): 'Pool db1088 @ 50% into s6 T255927', diff saved to https://phabricator.wikimedia.org/P11647 and previous config saved to /var/cache/conftool/dbconfig/20200624-093624-kormat.json [09:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:29] T255927: db1088 crashed - https://phabricator.wikimedia.org/T255927 [09:39:13] (03CR) 10Alexandros Kosiaris: WIP: chartmuseum: Add initial module, profile and role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:39:51] (03CR) 10Kormat: "Please ping me when this is merged, i'd like to do a follow-up to add a couple of new profiles." [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [09:40:48] (03PS1) 10Legoktm: Add initial puppetization for libraryupgrader [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) [09:41:10] kormat: 5 euros! :P [09:41:18] kidding, will do [09:41:23] hehe [09:41:25] thanks :) [09:41:28] (03PS3) 10Marostegui: mariadb: Promote db1097 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/606953 (https://phabricator.wikimedia.org/T254556) [09:41:58] (03CR) 10jerkins-bot: [V: 04-1] Add initial puppetization for libraryupgrader [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [09:43:33] (03PS2) 10Legoktm: Add initial puppetization for libraryupgrader [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) [09:45:36] (03PS1) 10Gergő Tisza: Help panel home screen menu item fixes [extensions/GrowthExperiments] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/607453 (https://phabricator.wikimedia.org/T255254) [09:50:31] !log volans@cumin1001 START - Cookbook sre.dns.netbox [09:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:39] !log kormat@cumin1001 dbctl commit (dc=all): 'Pool db1088 @ 75% into s6 T255927', diff saved to https://phabricator.wikimedia.org/P11648 and previous config saved to /var/cache/conftool/dbconfig/20200624-095338-kormat.json [09:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:43] T255927: db1088 crashed - https://phabricator.wikimedia.org/T255927 [09:53:47] (03CR) 10Volans: [C: 03+1] "Fair enough. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607281 (owner: 10Muehlenhoff) [09:55:15] (03PS4) 10Kormat: mariadb: Add monitoring for lag spikes (v2) [puppet] - 10https://gerrit.wikimedia.org/r/607039 (https://phabricator.wikimedia.org/T253120) [09:55:18] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:54] (03CR) 10Elukey: "Thanks a lot for all the comments, going to fix and re-send another version!" (0315 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [09:59:52] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10toorich) 05Stalled→03Open Because of a lot of thumbnail error, I think now can re-open this task. [10:00:03] (03PS1) 10Privacybatm: transferpy: Add reference to deb package in the documentation [software/transferpy] - 10https://gerrit.wikimedia.org/r/607456 (https://phabricator.wikimedia.org/T253736) [10:00:18] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987 (10toorich) [10:00:20] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10toorich) [10:00:22] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Marostegui) [10:00:27] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Marostegui) [10:01:06] !log Production management IP allocation must be done from Netbox from now on, see https://wikitech.wikimedia.org/wiki/DNS/Netbox#Cutoff_dates [10:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:53] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10toorich) Because of a lot of thumbnail error, I think now we can re-open this task. [10:04:43] (03PS1) 10Jbond: add wikiodough alias [labs/private] - 10https://gerrit.wikimedia.org/r/607458 [10:06:14] (03PS2) 10Jbond: add wikiodough alias [labs/private] - 10https://gerrit.wikimedia.org/r/607458 [10:06:41] (03CR) 10Jbond: [V: 03+2 C: 03+2] add wikiodough alias [labs/private] - 10https://gerrit.wikimedia.org/r/607458 (owner: 10Jbond) [10:08:56] (03PS1) 10Jbond: role::wikidough: test dot notation with alias interpolation token [puppet] - 10https://gerrit.wikimedia.org/r/607460 [10:10:09] (03CR) 10jerkins-bot: [V: 04-1] role::wikidough: test dot notation with alias interpolation token [puppet] - 10https://gerrit.wikimedia.org/r/607460 (owner: 10Jbond) [10:11:49] (03PS2) 10Jbond: role::wikidough: test dot notation with alias interpolation token [puppet] - 10https://gerrit.wikimedia.org/r/607460 [10:13:57] 10Puppet, 10User-jbond: Investigate hiera lookup dot notation - https://phabricator.wikimedia.org/T256221 (10jbond) p:05Triage→03Medium [10:16:30] (03PS3) 10Jbond: role::wikidough: test dot notation with alias interpolation token [puppet] - 10https://gerrit.wikimedia.org/r/607460 (https://phabricator.wikimedia.org/T256221) [10:19:23] (03CR) 10Muehlenhoff: [C: 03+2] Add pwstore (dummy) profile [puppet] - 10https://gerrit.wikimedia.org/r/607281 (owner: 10Muehlenhoff) [10:20:06] 10Operations, 10Traffic, 10serviceops, 10affects-Kiwix-and-openZIM: ETAG response headers not always with double-quotes - https://phabricator.wikimedia.org/T256217 (10ema) p:05Triage→03Medium [10:20:07] (03Abandoned) 10Jbond: role::wikidough: test dot notation with alias interpolation token [puppet] - 10https://gerrit.wikimedia.org/r/607460 (https://phabricator.wikimedia.org/T256221) (owner: 10Jbond) [10:21:01] 10Puppet, 10Patch-For-Review, 10User-jbond: Investigate hiera lookup dot notation - https://phabricator.wikimedia.org/T256221 (10jbond) related change: https://gerrit.wikimedia.org/r/c/labs/private/+/607458 [10:21:17] (03PS1) 10Jbond: Revert "add wikiodough alias" [labs/private] - 10https://gerrit.wikimedia.org/r/607462 (https://phabricator.wikimedia.org/T256221) [10:22:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "add wikiodough alias" [labs/private] - 10https://gerrit.wikimedia.org/r/607462 (https://phabricator.wikimedia.org/T256221) (owner: 10Jbond) [10:24:07] (03PS1) 10Muehlenhoff: Fix permissions for pwstore dir [puppet] - 10https://gerrit.wikimedia.org/r/607463 [10:25:50] (03CR) 10Volans: [C: 03+1] "Right!" [puppet] - 10https://gerrit.wikimedia.org/r/607463 (owner: 10Muehlenhoff) [10:27:05] (03CR) 10Muehlenhoff: [C: 03+2] Fix permissions for pwstore dir [puppet] - 10https://gerrit.wikimedia.org/r/607463 (owner: 10Muehlenhoff) [10:28:31] (03PS1) 10Marostegui: dbproxy1003: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607464 (https://phabricator.wikimedia.org/T256216) [10:29:12] (03CR) 10Marostegui: [C: 03+2] dbproxy1003: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607464 (https://phabricator.wikimedia.org/T256216) (owner: 10Marostegui) [10:30:08] jbond42: you've got changes pending to merge in puppet [10:30:33] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Marostegui) [10:30:41] marostegui: thanks merging [10:34:50] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10Aklapper) 05Open→03Stalled @toorich: This has nothing to do with "lots of thumbnail errors". See previous comments that this is [stalled](https://www.mediawiki.org/wi... [10:35:14] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987 (10Aklapper) [10:35:15] FYI, I'll temporarily power down gerrit1002 to unblock a reboot of a Ganeti virtualisation node. gerrit1002 it's the VM powering the test instance gerrit-test.wikimedia.org, it should be back up in approx half an hour [10:35:16] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10Aklapper) [10:36:22] (03CR) 10Jcrespo: [C: 03+2] transferpy: Add reference to deb package in the documentation [software/transferpy] - 10https://gerrit.wikimedia.org/r/607456 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [10:36:24] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:36:25] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:04] !log Stop haproxy on dbproxy1003 T256216 [10:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:08] T256216: decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 [10:38:41] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Marostegui) haproxy stopped, let's give it a few days to make sure nothing breaks. [10:39:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:44] PROBLEM - Host etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:42:23] ^ ganeti1009 reboot, expected [10:43:01] (03CR) 10Jbond: [C: 03+1] "lgtm unrelated comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607368 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [10:45:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:32] RECOVERY - Host etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [10:50:37] 10Operations, 10DBA, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) As per our IRC chat, I have changed the database to `cas_test` [10:52:00] (03PS1) 10JMeybohm: profile: thanos::swift::frontend add account for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/607467 (https://phabricator.wikimedia.org/T256020) [10:52:22] (03CR) 10jerkins-bot: [V: 04-1] profile: thanos::swift::frontend add account for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/607467 (https://phabricator.wikimedia.org/T256020) (owner: 10JMeybohm) [10:53:11] 10Operations, 10DBA, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jcrespo) Please ping me when db is setup but before closing this ticket to make sure backups are correctly configured, as this seems to be an important database to no... [10:59:49] (03PS4) 10MSantos: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T1100). [11:00:04] awight, Nikerabbit, and tgr: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:23] I can deploy today. Nikerabbit do you care if I merge your config changes, or would you prefer to do it yourself? [11:01:05] (03CR) 10Awight: [C: 03+2] "BACON" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605255 (https://phabricator.wikimedia.org/T254458) (owner: 10Andrew-WMDE) [11:02:02] (03Merged) 10jenkins-bot: TwoColConflict: Talk page small deployment CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605255 (https://phabricator.wikimedia.org/T254458) (owner: 10Andrew-WMDE) [11:02:56] (03CR) 10Awight: [C: 03+2] "The redundant default in CommonSettings-labs.php can be removed as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605255 (https://phabricator.wikimedia.org/T254458) (owner: 10Andrew-WMDE) [11:07:02] 10Puppet, 10Patch-For-Review, 10User-jbond: Investigate hiera lookup dot notation - https://phabricator.wikimedia.org/T256221 (10jbond) [11:08:57] awight: sorry I'm late [11:09:01] !log awight@deploy1001 Synchronized wmf-config/CommonSettings.php: BACON: [[gerrit:605255|TwoColConflict: Talk page small deployment CommonSettings.php (T254458)]] (duration: 01m 17s) [11:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:06] T254458: Deploy talk page edit conflict interface to small set of wikis - https://phabricator.wikimedia.org/T254458 [11:09:19] Nikerabbit: Perfect timing, actually :-) Your patches are up now. [11:09:29] (03PS1) 10JMeybohm: thanos::swift add chartmuseum account key [labs/private] - 10https://gerrit.wikimedia.org/r/607468 (https://phabricator.wikimedia.org/T256020) [11:10:17] awight: cool, do you know what is the process of updating PrivateSettings.php? does it need to be committed? or just synced as usual? [11:11:22] Nikerabbit: oof, unfortunately I can't answer... [11:11:45] awight: ok, well let's do the normal changes first [11:12:08] 10Operations, 10DBA, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) @jbond users created with access from the requested hosts: ` +--------------------------------------------------------------------------------------------... [11:12:19] Nikerabbit: Just to be sure, you're doing the deployment? [11:12:44] (:check-mark:, I see u active on the server!) [11:13:30] awight: umm sure, let me pull up the documentation [11:13:39] Nikerabbit: yeah, its a separate repo [11:13:51] awight: I can +2 the first one already? [11:14:00] Nikerabbit: I'm happy to continue deploying and do your patches! [11:14:09] Nikerabbit: Yes, you're clear now. [11:15:10] (03CR) 10Nikerabbit: [C: 03+2] "For deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780) (owner: 10DannyS712) [11:15:18] 10Operations, 10DBA, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) Once this is tested and ready to move to production m1, I will work on the .sql files to keep track of the new grants for the dbproxies IPs. @jbond rememb... [11:16:06] (03Restored) 10Jbond: build.gradle: add memcached support to cas blob [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/592659 (https://phabricator.wikimedia.org/T233931) (owner: 10Jbond) [11:16:49] 10Operations, 10serviceops: Remaining nginx packages on some mw servers - https://phabricator.wikimedia.org/T255565 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [11:16:53] awight: reasonable to skip deploying to a debug server for manual testing? [11:17:03] 10Operations, 10SRE-swift-storage, 10serviceops, 10Patch-For-Review: Access to the thanos-swift cluster for ChartMuseum - https://phabricator.wikimedia.org/T256020 (10JMeybohm) Commit in private is `e427c266f2d6ac0a937bf5d972b759933a9f9a18` [11:17:09] (03CR) 10Jbond: [V: 03+2 C: 03+2] "the deb in proiduction is based of this so merging" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/592659 (https://phabricator.wikimedia.org/T233931) (owner: 10Jbond) [11:17:25] (03CR) 10Ssingh: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607368 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:17:35] Nikerabbit: That's entirely up to you, if you think it'll be useful or not. [11:18:08] (03CR) 10Ssingh: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607368 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:18:32] (03CR) 10Ssingh: [C: 03+2] wikidough: update dnsdist web server listen address [puppet] - 10https://gerrit.wikimedia.org/r/607368 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [11:18:54] awight: imho the only thing to watch out is for undefined variable warnings (have confirmed with code search that these variables are not in use) [11:20:11] (03PS1) 10Jbond: JPA: add jpa support for u2f tokens [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607469 (https://phabricator.wikimedia.org/T256113) [11:20:15] +1 that makes sense, especially if it's non-fatal [11:21:23] tgr: gotcha, so commit it locally and then sync [11:21:36] yeah [11:21:42] * awight learns [11:22:14] (03PS6) 10Nikerabbit: Remove TranslationNotifications user settings 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780) (owner: 10DannyS712) [11:22:38] (03CR) 10Nikerabbit: [C: 03+2] "For deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780) (owner: 10DannyS712) [11:23:07] It was saying Merge conflict and it was not submitted, so I rebased and +2ed again (I hope that was correct thing to do) [11:23:13] Nikerabbit: historically, you also needed to touch symlinks before the sync. Not sure if that's still the case or was HHVM-specific. [11:23:18] (03PS2) 10Nikerabbit: Remove TranslationNotifications user settings 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607414 (https://phabricator.wikimedia.org/T144780) [11:23:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607469 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [11:23:31] (03Merged) 10jenkins-bot: Remove TranslationNotifications user settings 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780) (owner: 10DannyS712) [11:24:06] tgr: hmm no idea which symlink [11:24:18] PrivateSettings.php [11:25:16] ah [11:25:19] will do that [11:26:01] (touch -h because otherwise it would just touch the symlink target) [11:27:11] oh, good to know [11:28:05] !log nikerabbit@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [config] 603167 Remove TranslationNotifications user settings 1/2 (duration: 01m 03s) [11:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:21] logs look clean, going forward with second patch [11:30:12] (03CR) 10Nikerabbit: [C: 03+2] Remove TranslationNotifications user settings 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607414 (https://phabricator.wikimedia.org/T144780) (owner: 10Nikerabbit) [11:30:49] tgr: oh actually, I am not sure where the symlink actually is [11:31:08] (03Merged) 10jenkins-bot: Remove TranslationNotifications user settings 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607414 (https://phabricator.wikimedia.org/T144780) (owner: 10Nikerabbit) [11:31:18] /srv/mediawiki-staging/wmf-config/PrivateSettings.php, I think [11:31:52] no such file... I assume it's being loaded without symlink these days? [11:33:00] T126306 was the relevant task [11:33:00] T126306: Eliminate symlinks in mediawiki-config (as much as possible) - https://phabricator.wikimedia.org/T126306 [11:34:00] it's not mentioned there, at least [11:34:35] oh, it actually is: https://phabricator.wikimedia.org/T126306#3616529 [11:35:23] cool [11:35:59] !log nikerabbit@deploy1001 Synchronized private/readme.php: [config] 607414 Remove TranslationNotifications user settings 2/2 (duration: 01m 04s) [11:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:47] now doing privatesettings [11:40:21] !log nikerabbit@deploy1001 Synchronized private/PrivateSettings.php: Remove TranslationNotifications user settings 3/2 (duration: 01m 06s) [11:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:16] nothing relevant on the logs so far. tgr, awight: I think you can continue now [11:42:45] tgr: May I deploy your backport? [11:42:58] awight: yes, thanks! [11:43:57] ack [11:44:07] tgr, awight: thanks for the help and support :) [11:44:30] (03CR) 10Awight: [C: 03+2] "BACON" [extensions/GrowthExperiments] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/607453 (https://phabricator.wikimedia.org/T255254) (owner: 10Gergő Tisza) [11:52:12] (03PS1) 10Jbond: apereo_cas: add support to store u2f using JPA [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) [11:52:14] (03PS1) 10Jbond: idp_test: enable u2f jpa [puppet] - 10https://gerrit.wikimedia.org/r/607476 (https://phabricator.wikimedia.org/T256120) [11:52:30] (03PS3) 10Arturo Borrero Gonzalez: toolforge: mailrelay: enforce ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/607320 (https://phabricator.wikimedia.org/T175964) [11:52:32] (03Merged) 10jenkins-bot: Help panel home screen menu item fixes [extensions/GrowthExperiments] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/607453 (https://phabricator.wikimedia.org/T255254) (owner: 10Gergő Tisza) [11:53:23] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: add support to store u2f using JPA [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [11:55:17] (03PS1) 10Ssingh: wikidough: improve naming of hiera keys and class variables [puppet] - 10https://gerrit.wikimedia.org/r/607477 [11:55:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: mailrelay: enforce ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/607320 (https://phabricator.wikimedia.org/T175964) (owner: 10Arturo Borrero Gonzalez) [11:57:03] (03PS2) 10Jbond: apereo_cas: add support to store u2f using JPA [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) [11:57:53] 10Operations, 10serviceops, 10Sustainability (Incident Prevention): Increase capacity of the sessionstore dedicated kubernetes nodes - https://phabricator.wikimedia.org/T256236 (10akosiaris) [11:58:00] 10Operations, 10serviceops, 10Sustainability (Incident Prevention): Increase capacity of the sessionstore dedicated kubernetes nodes - https://phabricator.wikimedia.org/T256236 (10akosiaris) p:05Triage→03High [11:58:09] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: add support to store u2f using JPA [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [11:59:46] tgr: Your change is live on mwdebug1001 [11:59:53] (03PS1) 10Ssingh: wikidough: update key name in wikidough.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/607480 [12:00:09] (03CR) 10Ssingh: [V: 03+2 C: 03+2] wikidough: update key name in wikidough.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/607480 (owner: 10Ssingh) [12:00:36] (03PS3) 10Jbond: apereo_cas: add support to store u2f using JPA [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) [12:01:44] (03PS4) 10Jbond: apereo_cas: add support to store u2f using JPA [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) [12:01:46] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: add support to store u2f using JPA [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [12:01:52] awight: works, thanks! [12:02:01] great, continuing [12:02:07] 10Operations, 10serviceops, 10Sustainability (Incident Prevention): Increase capacity of the sessionstore dedicated kubernetes nodes - https://phabricator.wikimedia.org/T256236 (10akosiaris) Currently, sessionstore sets a limit of 400Mi and 2.5 CPUs[1]. Memory wise, the nodes have 4GB RAM and 6 CPUs. The eas... [12:02:08] Sorry about the delay :-) [12:03:10] !log awight@deploy1001 Synchronized php-1.35.0-wmf.38/extensions/GrowthExperiments: BACON: [[gerrit:607453|Help panel home screen menu item fixes (T255254)]] (duration: 01m 06s) [12:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:14] T255254: Newcomer tasks: Fix back-then-close issue when using guidance with editing - https://phabricator.wikimedia.org/T255254 [12:03:38] (03PS2) 10Jbond: idp_test: enable u2f jpa [puppet] - 10https://gerrit.wikimedia.org/r/607476 (https://phabricator.wikimedia.org/T256120) [12:04:02] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/607476 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [12:04:25] !log EU vegan BACON cooked [12:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:00] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2005.codfw.wmnet [12:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:04] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2006.codfw.wmnet [12:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:01] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-m [12:08:27] (03CR) 10Jbond: [V: 03+2 C: 03+2] JPA: add jpa support for u2f tokens [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/607469 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [12:09:51] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:10:52] !log depool kubernetes2005,kubernetes2006 for CPU capacity increase [12:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:02] !log depool kubernetes2005,kubernetes2006 for CPU capacity increase T256236 [12:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:06] T256236: Increase capacity of the sessionstore dedicated kubernetes nodes - https://phabricator.wikimedia.org/T256236 [12:14:28] !log reboot kubernetes2005,6 for CPU capacity increase T256236 [12:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:17] (03CR) 10Muehlenhoff: apereo_cas: add support to store u2f using JPA (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [12:17:53] !log depool/drain/reboot/pool kubernetes1005,6 for CPU capacity increase T256236 [12:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:56] T256236: Increase capacity of the sessionstore dedicated kubernetes nodes - https://phabricator.wikimedia.org/T256236 [12:18:13] (03CR) 10Muehlenhoff: apereo_cas: add support to store u2f using JPA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [12:19:11] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes2006.codfw.wmnet [12:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:14] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes2005.codfw.wmnet [12:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:22] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes1005.eqiad.wmnet [12:19:25] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes1006.eqiad.wmnet [12:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:44] (03PS5) 10Jbond: apereo_cas: add support to store u2f using JPA [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) [12:23:05] (03CR) 10Jbond: "updated thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [12:23:42] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes1006.eqiad.wmnet [12:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:46] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes1005.eqiad.wmnet [12:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:00] (03PS5) 10MSantos: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) [12:28:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [12:31:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607476 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [12:32:52] (03PS1) 10Filippo Giunchedi: pontoon: fix storeconfigs type and hiera defaults [puppet] - 10https://gerrit.wikimedia.org/r/607491 [12:34:39] (03CR) 10Kormat: [C: 03+1] pontoon: fix storeconfigs type and hiera defaults [puppet] - 10https://gerrit.wikimedia.org/r/607491 (owner: 10Filippo Giunchedi) [12:35:24] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: fix storeconfigs type and hiera defaults [puppet] - 10https://gerrit.wikimedia.org/r/607491 (owner: 10Filippo Giunchedi) [12:38:11] (03CR) 10Majavah: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [12:41:07] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: [EPIC] Deploy push-notifications service to production - https://phabricator.wikimedia.org/T256237 (10MSantos) [12:41:33] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: [EPIC] Deploy push-notifications service to production - https://phabricator.wikimedia.org/T256237 (10MSantos) [12:42:01] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: [EPIC] Deploy push-notifications service to production - https://phabricator.wikimedia.org/T256237 (10MSantos) [12:48:14] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: [EPIC] Deploy push-notifications service to production - https://phabricator.wikimedia.org/T256237 (10MSantos) [12:50:30] (03PS2) 10Arturo Borrero Gonzalez: toolforge: mailrelay: collect exim metrics using prometheus [puppet] - 10https://gerrit.wikimedia.org/r/607324 (https://phabricator.wikimedia.org/T175964) [12:51:07] (03CR) 10Arturo Borrero Gonzalez: "tested this already by live-hacking the puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/607324 (https://phabricator.wikimedia.org/T175964) (owner: 10Arturo Borrero Gonzalez) [12:51:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: mailrelay: collect exim metrics using prometheus [puppet] - 10https://gerrit.wikimedia.org/r/607324 (https://phabricator.wikimedia.org/T175964) (owner: 10Arturo Borrero Gonzalez) [12:53:13] (03PS1) 10Jbond: idp: enable memcached on production idp servers [puppet] - 10https://gerrit.wikimedia.org/r/607492 (https://phabricator.wikimedia.org/T256113) [12:54:07] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 287.4 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [12:54:39] (03PS2) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce SRS to correctly envelope forwarded emails [puppet] - 10https://gerrit.wikimedia.org/r/607279 (https://phabricator.wikimedia.org/T120225) [12:54:57] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/23420/" [puppet] - 10https://gerrit.wikimedia.org/r/607492 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [12:55:48] (03PS1) 10Jcrespo: mariadb-backups: Productionize db1145 after cloning db1102 into it [puppet] - 10https://gerrit.wikimedia.org/r/607493 (https://phabricator.wikimedia.org/T252512) [12:55:59] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:58:25] Nikerabbit: I think your deploy created a bunch of additional logging output? https://sal.toolforge.org/log/yYIh5nIBj_Bg1xd3AILK correlates with https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&refresh=5m [12:59:39] !log update metamonitoring to use icinga-extmon.wikimedia.org [12:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] brennen and hashar: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T1300). [13:02:36] (03PS6) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [13:04:38] Nikerabbit: looks like mostly DBReplication warnings on the jobrunners? https://logstash.wikimedia.org/goto/ee829814670e4142588ab98ad36bf2b3 [13:04:39] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Productionize db1145 after cloning db1102 into it [puppet] - 10https://gerrit.wikimedia.org/r/607493 (https://phabricator.wikimedia.org/T252512) (owner: 10Jcrespo) [13:04:41] (03CR) 10jerkins-bot: [V: 04-1] hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [13:05:27] (03PS7) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [13:05:54] ah snap I didn't see the -1 from jenkins, another one coming I guess [13:06:56] it fails for older py versions because of fstrings [13:07:32] (03CR) 10jerkins-bot: [V: 04-1] hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [13:07:41] jbond42: tendril is down for me, could that be because of your idp change? [13:07:52] volans: hello hello, should we remove the py35 tox config? [13:07:57] (from cookbooks) [13:08:35] (03PS3) 10Jbond: idp_test: enable u2f jpa [puppet] - 10https://gerrit.wikimedia.org/r/607476 (https://phabricator.wikimedia.org/T256120) [13:08:52] marostegui: no, not yet im still to merge anything for that [13:08:53] elukey: oh boy, my bad, I didn't manage to send the CR yesterday but I thought I did [13:08:57] let me send it right awat [13:09:01] was just about to thugh so will hold off [13:09:13] jbond42: it seems to get stuck at idp.wikimedia.org (https://tendril.wikimedia.org/report/slow_queries?host=^db&user=wikiuser&schema=wik&hours=1) [13:09:15] (03PS3) 10Ppchelko: EventBus: Emit kafka purges for everything [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607298 (https://phabricator.wikimedia.org/T250781) [13:11:43] marostegui: im really not sure but if its causing problems you can delete the anything related to idp and we can recreate later as said i have not merged anything yet [13:12:13] jbond42: no, the database creation shouldn't mess with that I don't think [13:12:20] but also, it seems to get stuck at idp.wikimedia.org [13:12:31] not sure if something has changed, definitely nothing has changed at tendril's level [13:13:00] cas-icinga and cas-logstash stopped working for me a couple of hours ago [13:13:21] ahh ok let me look at that then [13:13:59] kormat: marostegui: can you try clearing cookies and try again. i made a change yesterday which may cause issues [13:14:13] yeah I had a redirect loop on one of my computers that was solved by clearing idp cookies [13:14:18] also i didn;t think cas-logstash ever worked [13:15:06] (03PS1) 10Alexandros Kosiaris: Introduce kubernetes[12]01[56] [dns] - 10https://gerrit.wikimedia.org/r/607495 (https://phabricator.wikimedia.org/T256236) [13:15:06] jbond42: that's possible :) [13:15:14] i just tested it to see if my issue was icinga-specific [13:15:21] ahh ack [13:15:48] yeah, clearing cookies fixed it [13:15:48] thanks [13:15:55] great thanks [13:15:55] (03CR) 10jerkins-bot: [V: 04-1] Introduce kubernetes[12]01[56] [dns] - 10https://gerrit.wikimedia.org/r/607495 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris) [13:16:03] * kormat guards cookies protectively [13:16:22] 🍪 [13:18:12] yeah, cas-logstash is broken and we can't really use it until we switch a Kibana release with SAML support in the FLOSS version [13:18:21] I'll make a patch to remove it [13:18:29] (03PS1) 10Volans: Remove support for Python 3.5 and 3.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/607496 [13:18:31] (03PS1) 10Volans: actions: refactor for more recent Python versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/607497 [13:18:33] (03PS1) 10Volans: Add type hints for variables and attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/607498 [13:18:46] jbond42: removing the cookie for idp.wm.o and cas-icinga.wm.o fixed the issue for me [13:18:54] (removing _just_ the idp.wm.o cookie did not) [13:19:20] ack thanks, i suspect just removing the icinga one would work but not 100% [13:19:27] cdanis: sorry, saw ping now. did that get resolved? [13:19:37] Nikerabbit: no, looks like it's still ongoing [13:19:39] when did you establish your session? are you using a long term session (aka Remember me)? [13:20:04] Nikerabbit: there's also a bunch of added syslog spam from php-fpm on jobrunners https://logstash.wikimedia.org/goto/7f1b8370cbf382c0f8cdb8c7ec316a1e [13:20:31] moritzm: you're asking me? i seem to get prompted once a week to authenticate (on mondays, as it happens). i'm not doing anything special that i'm aware of [13:20:39] moritzm: fyi i think the issues was caused by the fix to https://github.com/apereo/mod_auth_cas/issues/186, i also saw simlar issues when i changed the Proy-as url to remove the '//' [13:20:50] cdanis: looking, though it seems impossible that the change itself would cause anything like this [13:21:02] yeah I haven't dug into it, just noticed it seemed to correlate [13:22:17] Nikerabbit: ah, I've found it: we're getting 15k messages/5 minutes of: PHP Notice: Undefined variable: wmgTranslationNotificationUserPassword in /srv/mediawiki/wmf-config/CommonSettings.php on line 2911 [13:22:53] cdanis: weird, why did that not appear in log-spam watch [13:23:12] Nikerabbit: for whatever reason it's coming from the 'syslog' input instead of the 'mediawiki' input [13:23:38] wait what... CommonSettings.php [13:23:42] !log Deploy schema change on s6 eqiad primary master - T238966 [13:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:46] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [13:23:56] (03CR) 10Elukey: [C: 03+1] Remove support for Python 3.5 and 3.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/607496 (owner: 10Volans) [13:24:12] 10Operations, 10vm-requests: Site: 2 VM request for kubernetes sessionstore dedicated nodes - https://phabricator.wikimedia.org/T256254 (10akosiaris) [13:24:19] cdanis: damnit, I synced the wrong file! [13:24:25] 10Operations, 10vm-requests: Site: 2 VM request for kubernetes sessionstore dedicated nodes - https://phabricator.wikimedia.org/T256254 (10akosiaris) p:05Triage→03High [13:24:29] Nikerabbit: a story as old as scap [13:24:31] :) [13:24:41] umm, okay to sync the right file now? [13:24:41] 10Operations, 10vm-requests: Site: 2 VM request for kubernetes sessionstore dedicated nodes - https://phabricator.wikimedia.org/T256254 (10akosiaris) [13:24:43] 10Operations, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Prevention): Increase capacity of the sessionstore dedicated kubernetes nodes - https://phabricator.wikimedia.org/T256236 (10akosiaris) [13:24:47] jouncebot: now [13:24:47] For the next 1 hour(s) and 35 minute(s): Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T1300) [13:24:54] (03CR) 10Volans: [C: 03+2] Remove support for Python 3.5 and 3.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/607496 (owner: 10Volans) [13:25:05] brennen|afk: hashar: are you deploying now? [13:25:47] Nikerabbit: if you don't hear back in a couple minutes, I say go ahead and do it [13:25:50] cdanis: nop [13:26:00] please run whatever you need :] [13:26:01] Nikerabbit: ^ [13:26:03] 👍 [13:26:14] thanks, I'll do it [13:27:07] (03Merged) 10jenkins-bot: Remove support for Python 3.5 and 3.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/607496 (owner: 10Volans) [13:28:42] cdanis: perhaps php warnings generated in configuration files are handled differently, before proper logging is set up. I was not aware of that :/ [13:28:43] !log nikerabbit@deploy1001 Synchronized wmf-config/CommonSettings.php: [config] 603167 Remove TranslationNotifications user settings 1/2 (2nd attempt, now with correct file) (duration: 01m 06s) [13:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:52] yeah it would make some sense [13:29:25] jbond42: yeah, that was my thought as well [13:29:25] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/607497 (owner: 10Volans) [13:29:39] kormat: ack, thx, you're using a long term session, then :-) [13:29:54] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/23421/malmok.wikimedia.org/change.malmok.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/607477 (owner: 10Ssingh) [13:30:05] (03PS1) 10Volans: Fix newly reported issues by prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/607504 [13:30:07] (03PS1) 10Volans: Remove support for old Python versions add newer [cookbooks] - 10https://gerrit.wikimedia.org/r/607505 [13:31:35] Nikerabbit: looks fixed, thanks! [13:32:09] (03CR) 10Elukey: [C: 03+1] actions: refactor for more recent Python versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/607497 (owner: 10Volans) [13:32:15] cdanis: good that you noticed and found the real cause, that was very helpful [13:32:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, I'd suggest testing the change with curl first, from e.g. prometheus1003, and see if metrics can be scraped" [puppet] - 10https://gerrit.wikimedia.org/r/607301 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [13:32:47] (03CR) 10Elukey: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [13:32:49] (03PS2) 10Alexandros Kosiaris: Introduce kubernetes[12]01[56] [dns] - 10https://gerrit.wikimedia.org/r/607495 (https://phabricator.wikimedia.org/T256236) [13:33:19] (03CR) 10jerkins-bot: [V: 04-1] Introduce kubernetes[12]01[56] [dns] - 10https://gerrit.wikimedia.org/r/607495 (https://phabricator.wikimedia.org/T256236) (owner: 10Alexandros Kosiaris) [13:35:13] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Marostegui) [13:35:40] (03CR) 10jerkins-bot: [V: 04-1] hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [13:37:44] (03CR) 10Elukey: [C: 03+1] "From my ignorant-codebase point of view it looks good, I don't have a lot of context when/why cast() is used but its usage looks sound." [software/spicerack] - 10https://gerrit.wikimedia.org/r/607498 (owner: 10Volans) [13:38:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/607498 (owner: 10Volans) [13:38:22] (03CR) 10Volans: [C: 03+2] actions: refactor for more recent Python versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/607497 (owner: 10Volans) [13:39:38] (03CR) 10Jcrespo: "see inline response" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [13:40:11] (03CR) 10Elukey: [C: 03+1] Fix newly reported issues by prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/607504 (owner: 10Volans) [13:40:26] (03CR) 10Elukey: [C: 03+1] Remove support for old Python versions add newer [cookbooks] - 10https://gerrit.wikimedia.org/r/607505 (owner: 10Volans) [13:41:08] (03Merged) 10jenkins-bot: actions: refactor for more recent Python versions [software/spicerack] - 10https://gerrit.wikimedia.org/r/607497 (owner: 10Volans) [13:43:23] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/607498 (owner: 10Volans) [13:44:24] (03CR) 10Volans: [C: 03+2] Fix newly reported issues by prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/607504 (owner: 10Volans) [13:44:30] (03CR) 10Volans: [C: 03+2] Remove support for old Python versions add newer [cookbooks] - 10https://gerrit.wikimedia.org/r/607505 (owner: 10Volans) [13:45:43] (03CR) 10Jbond: [C: 03+2] apereo_cas: add support to store u2f using JPA [puppet] - 10https://gerrit.wikimedia.org/r/607475 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [13:45:48] (03CR) 10Jbond: [C: 03+2] idp_test: enable u2f jpa [puppet] - 10https://gerrit.wikimedia.org/r/607476 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [13:46:51] (03PS3) 10Alexandros Kosiaris: Introduce kubernetes[12]01[56] [dns] - 10https://gerrit.wikimedia.org/r/607495 (https://phabricator.wikimedia.org/T256236) [13:47:10] (03Merged) 10jenkins-bot: Fix newly reported issues by prospector [cookbooks] - 10https://gerrit.wikimedia.org/r/607504 (owner: 10Volans) [13:47:12] (03Merged) 10jenkins-bot: Remove support for old Python versions add newer [cookbooks] - 10https://gerrit.wikimedia.org/r/607505 (owner: 10Volans) [13:48:19] (03PS1) 10Muehlenhoff: Remove cas-logstash from caches [puppet] - 10https://gerrit.wikimedia.org/r/607508 (https://phabricator.wikimedia.org/T246998) [13:48:21] (03PS1) 10Muehlenhoff: Remove IDP defintions for logstash vhosts [puppet] - 10https://gerrit.wikimedia.org/r/607509 (https://phabricator.wikimedia.org/T246998) [13:48:41] (03CR) 10jerkins-bot: [V: 04-1] Remove cas-logstash from caches [puppet] - 10https://gerrit.wikimedia.org/r/607508 (https://phabricator.wikimedia.org/T246998) (owner: 10Muehlenhoff) [13:49:31] (03PS8) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [13:50:34] (03CR) 10jerkins-bot: [V: 04-1] hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [13:51:06] CUSTOM - LVS thanos-query codfw port 80/tcp - Prometheus long-term storage- query service IPv4 #page on thanos-query.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 153 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:52:15] 10Operations, 10Wikimedia-Mailing-lists, 10Accessibility: Pipermail uses background color without foreground colors - https://phabricator.wikimedia.org/T190061 (10ema) [13:53:29] PROBLEM - Check systemd state on icinga1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:39] PROBLEM - Check the last execution of sync_check_icinga_contacts on icinga1001 is CRITICAL: CRITICAL: Status of the systemd unit sync_check_icinga_contacts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:54:04] 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list for ILAE English Wikipedia project - https://phabricator.wikimedia.org/T256193 (10ema) p:05Triage→03Medium [13:54:19] 10Operations, 10Wikimedia-Mailing-lists: Creation of mailinglist for Board of WUG Esperanto and Free Knowledge - https://phabricator.wikimedia.org/T255951 (10ema) p:05Triage→03Medium [13:54:20] (03PS1) 10Jcrespo: mariadb-backups: Move x1 backup source from db1095 to db1102 [puppet] - 10https://gerrit.wikimedia.org/r/607510 (https://phabricator.wikimedia.org/T254871) [13:55:29] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Move x1 backup source from db1095 to db1102 [puppet] - 10https://gerrit.wikimedia.org/r/607510 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [13:58:06] (03PS1) 10Andrew Bogott: dbproxy1021: define openstack_controllers for firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/607513 [13:58:38] (03PS2) 10Jcrespo: mariadb-backups: Move x1 backup source from db1095 to db1102 [puppet] - 10https://gerrit.wikimedia.org/r/607510 (https://phabricator.wikimedia.org/T254871) [13:59:25] (03PS7) 10JMeybohm: WIP: chartmuseum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) [13:59:45] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Move x1 backup source from db1095 to db1102 [puppet] - 10https://gerrit.wikimedia.org/r/607510 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [14:02:32] (03PS1) 10Volans: metamonitoring: email is optional in Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/607514 [14:02:33] godog: fix ^^^ [14:03:05] (03PS1) 10Jcrespo: mariadb-backups: Remove x1 from db1095 and enable db1102 notif. [puppet] - 10https://gerrit.wikimedia.org/r/607515 (https://phabricator.wikimedia.org/T254871) [14:03:30] (03CR) 10CDanis: [C: 03+1] metamonitoring: email is optional in Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/607514 (owner: 10Volans) [14:03:34] (03CR) 10Filippo Giunchedi: [C: 03+1] metamonitoring: email is optional in Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/607514 (owner: 10Volans) [14:03:41] (03CR) 10Volans: [C: 03+2] metamonitoring: email is optional in Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/607514 (owner: 10Volans) [14:04:59] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:06:21] (03PS3) 10Jcrespo: mariadb-backups: Move x1 backup source from db1095 to db1102 [puppet] - 10https://gerrit.wikimedia.org/r/607510 (https://phabricator.wikimedia.org/T254871) [14:07:43] (03CR) 10Volans: "Replied to question" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [14:08:37] (03PS2) 10Jcrespo: mariadb-backups: Remove x1 from db1095 and enable db1102 notif. [puppet] - 10https://gerrit.wikimedia.org/r/607515 (https://phabricator.wikimedia.org/T254871) [14:09:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:10:28] (03CR) 10Volans: [C: 03+2] Add type hints for variables and attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/607498 (owner: 10Volans) [14:11:33] (03PS3) 10Jcrespo: mariadb-backups: Remove x1 from db1095 and enable db1102 notif. [puppet] - 10https://gerrit.wikimedia.org/r/607515 (https://phabricator.wikimedia.org/T254871) [14:12:49] (03Merged) 10jenkins-bot: Add type hints for variables and attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/607498 (owner: 10Volans) [14:13:45] (03PS9) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [14:15:00] (03CR) 10RLazarus: cumin: backup all of /srv where a lot of deployment state may live (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [14:16:35] 10Operations, 10SRE-tools, 10Patch-For-Review: Audit all cumin queries in switchdc scripts - https://phabricator.wikimedia.org/T243935 (10Volans) a:05Volans→03Kormat The only out of date module is the mysql one (now mysql_legacy). Agreed with @Kormat that he will take care of it given the new Puppet sele... [14:16:53] (03PS1) 10Jbond: rake_utils: make the yaml_defaults check none voting [puppet] - 10https://gerrit.wikimedia.org/r/607517 [14:17:13] (03CR) 10jerkins-bot: [V: 04-1] rake_utils: make the yaml_defaults check none voting [puppet] - 10https://gerrit.wikimedia.org/r/607517 (owner: 10Jbond) [14:19:02] (03PS4) 10Hashar: Bump minimum Python to 3.5; also test with 3.7 [software/keyholder] - 10https://gerrit.wikimedia.org/r/485706 (owner: 10Faidon Liambotis) [14:19:40] (03CR) 10Muehlenhoff: [C: 03+1] cumin: backup all of /srv where a lot of deployment state may live (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [14:20:53] *14 [14:20:53] hashar: please don't merge any keyholder changes [14:21:01] (03PS2) 10Jbond: rake_utils: make the yaml_defaults check none voting [puppet] - 10https://gerrit.wikimedia.org/r/607517 [14:21:03] (03CR) 10Filippo Giunchedi: "LGTM, just a nit re: account name" (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/607468 (https://phabricator.wikimedia.org/T256020) (owner: 10JMeybohm) [14:21:05] (03CR) 10Filippo Giunchedi: "LGTM, just a nit re: account name" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/607467 (https://phabricator.wikimedia.org/T256020) (owner: 10JMeybohm) [14:21:28] paravoid: ho no I am not going to merge any. I have the first two on my review chain somehow and I am reordering them to get them to pass the tests;) [14:21:40] ack, thanks :) [14:21:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/607492 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [14:22:22] paravoid: but I guess the first few in the chain are rather trivial ;] [14:22:28] (03CR) 10Jbond: [C: 03+2] idp: enable memcached on production idp servers [puppet] - 10https://gerrit.wikimedia.org/r/607492 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [14:22:39] (03PS4) 10Hashar: protocol.compat: disable a couple of pylint errors [software/keyholder] - 10https://gerrit.wikimedia.org/r/485705 (owner: 10Faidon Liambotis) [14:23:37] (03CR) 10Hashar: [C: 03+1] "I have made this the first in the chain and seems straightforward." [software/keyholder] - 10https://gerrit.wikimedia.org/r/485706 (owner: 10Faidon Liambotis) [14:24:05] (03PS1) 10Jbond: Revert "idp: enable memcached on production idp servers" [puppet] - 10https://gerrit.wikimedia.org/r/607518 [14:24:20] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "idp: enable memcached on production idp servers" [puppet] - 10https://gerrit.wikimedia.org/r/607518 (owner: 10Jbond) [14:24:44] (03PS1) 10Jbond: idp: enable memcached on production idp servers [puppet] - 10https://gerrit.wikimedia.org/r/607519 (https://phabricator.wikimedia.org/T256113) [14:24:46] RECOVERY - Check the last execution of sync_check_icinga_contacts on icinga1001 is OK: OK: Status of the systemd unit sync_check_icinga_contacts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:25:32] (03CR) 10Jcrespo: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [14:26:09] (03CR) 10Jbond: [C: 03+2] rake_utils: make the yaml_defaults check none voting [puppet] - 10https://gerrit.wikimedia.org/r/607517 (owner: 10Jbond) [14:26:38] (03PS1) 10Ottomata: Migrate SearchSatisfaction from EventLogging to EventGate on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607520 (https://phabricator.wikimedia.org/T238230) [14:26:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607519 (https://phabricator.wikimedia.org/T256113) (owner: 10Jbond) [14:26:45] (03CR) 10Hashar: "Rebased on top of https://gerrit.wikimedia.org/r/#/c/operations/software/keyholder/+/485706/ which drops python 3.4" [software/keyholder] - 10https://gerrit.wikimedia.org/r/485705 (owner: 10Faidon Liambotis) [14:27:55] (03CR) 10Volans: "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [14:28:30] (03PS2) 10JMeybohm: profile: thanos::swift::frontend add account for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/607467 (https://phabricator.wikimedia.org/T256020) [14:32:22] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: thanos::swift::frontend add account for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/607467 (https://phabricator.wikimedia.org/T256020) (owner: 10JMeybohm) [14:33:50] (03CR) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [14:34:58] (03PS10) 10Elukey: hadoop - Add change-distro.py and stop-cluster.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [14:36:40] RECOVERY - Check systemd state on icinga1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:55] hashar: trying to read backscroll but, is train deploy done? can I deploy a config change? [14:38:21] 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list for ILAE English Wikipedia project - https://phabricator.wikimedia.org/T256193 (10ema) @Diptanshu.D: list created, you should have received an email. [14:38:41] ottomata: yeah, no train needed this timeslot [14:39:07] k gr8 [14:39:08] danke [14:39:24] (03CR) 10Ottomata: [C: 03+2] Migrate SearchSatisfaction from EventLogging to EventGate on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607520 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:42:03] (03PS2) 10JMeybohm: thanos::swift add chartmuseum account key [labs/private] - 10https://gerrit.wikimedia.org/r/607468 (https://phabricator.wikimedia.org/T256020) [14:42:25] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate SearchSatisfaction from EventLogging to EventGate on group0 - T249261 (duration: 01m 06s) [14:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:33] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [14:45:28] (03CR) 10Volans: "Lokks good to me as a starting point to iterate on, minor nits inline." (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [14:45:54] (03PS1) 10Kormat: pontoon: Fake confd, and protect against double-enroll. [puppet] - 10https://gerrit.wikimedia.org/r/607523 [14:47:11] (03PS1) 10Hashar: contint: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607524 [14:47:13] (03PS1) 10Hashar: doc: move Apache config to flat file [puppet] - 10https://gerrit.wikimedia.org/r/607525 [14:49:13] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) >>! In T256120#6252052, @Marostegui wrote: > Once this is tested and ready to move to production m1, I will work on the .sql files to kee... [14:50:06] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) Should be fixed now. [14:50:17] (03PS13) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) [14:50:55] (03PS14) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) [14:51:21] (03CR) 10JMeybohm: [C: 03+2] thanos::swift add chartmuseum account key [labs/private] - 10https://gerrit.wikimedia.org/r/607468 (https://phabricator.wikimedia.org/T256020) (owner: 10JMeybohm) [14:51:29] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] thanos::swift add chartmuseum account key [labs/private] - 10https://gerrit.wikimedia.org/r/607468 (https://phabricator.wikimedia.org/T256020) (owner: 10JMeybohm) [14:51:32] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/607477 (owner: 10Ssingh) [14:51:41] (03CR) 10Kormat: Add native mysql spicerack module. (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [14:51:53] (03CR) 10JMeybohm: [C: 03+2] profile: thanos::swift::frontend add account for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/607467 (https://phabricator.wikimedia.org/T256020) (owner: 10JMeybohm) [14:52:00] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/607477 (owner: 10Ssingh) [14:53:46] (03CR) 10RLazarus: "Aaron, Krinkle: Question for you in particular." [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [14:54:54] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 102.3 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [14:55:07] 10Operations, 10Wikimedia-Mailing-lists: Close teampractices mailing list (as it has no active admins) - https://phabricator.wikimedia.org/T255525 (10ema) p:05Triage→03Medium [14:55:28] (03PS1) 10Jbond: wikidough: add secrets [labs/private] - 10https://gerrit.wikimedia.org/r/607528 [14:56:20] (03CR) 10Jbond: [V: 03+2 C: 03+2] wikidough: add secrets [labs/private] - 10https://gerrit.wikimedia.org/r/607528 (owner: 10Jbond) [14:57:00] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607523 (owner: 10Kormat) [14:57:13] !log rmlist teampractices T255525 [14:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:17] T255525: Close teampractices mailing list (as it has no active admins) - https://phabricator.wikimedia.org/T255525 [14:57:29] (03CR) 10Kormat: [C: 03+2] pontoon: Fake confd, and protect against double-enroll. [puppet] - 10https://gerrit.wikimedia.org/r/607523 (owner: 10Kormat) [14:57:42] !log rebooting deneb for kernel update [14:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:46] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:14] 10Operations, 10Wikimedia-Mailing-lists: Close teampractices mailing list (as it has no active admins) - https://phabricator.wikimedia.org/T255525 (10ema) 05Open→03Resolved a:03ema [15:00:08] (03CR) 10MSantos: charts for push-notification service (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [15:00:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:34] (03PS1) 10Alexandros Kosiaris: conftool: Add new kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/607530 [15:00:36] (03PS1) 10Alexandros Kosiaris: lvs: Add new proton TLS service [puppet] - 10https://gerrit.wikimedia.org/r/607531 (https://phabricator.wikimedia.org/T225680) [15:00:38] (03PS1) 10Alexandros Kosiaris: lvs: Switch proton to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/607532 (https://phabricator.wikimedia.org/T225680) [15:00:40] (03PS1) 10Alexandros Kosiaris: lvs: Switch proton to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/607533 (https://phabricator.wikimedia.org/T225680) [15:00:42] (03PS1) 10Alexandros Kosiaris: lvs: Switch proton to production [puppet] - 10https://gerrit.wikimedia.org/r/607534 (https://phabricator.wikimedia.org/T225680) [15:00:44] (03PS1) 10Alexandros Kosiaris: proton: Switch dev restbase to talk to TLS proton [puppet] - 10https://gerrit.wikimedia.org/r/607535 (https://phabricator.wikimedia.org/T225680) [15:00:46] (03PS1) 10Alexandros Kosiaris: proton: Switch restbase production to TLS [puppet] - 10https://gerrit.wikimedia.org/r/607536 (https://phabricator.wikimedia.org/T225680) [15:01:27] (03PS2) 10Jbond: wikidough: improve naming of hiera keys and class variables [puppet] - 10https://gerrit.wikimedia.org/r/607477 (owner: 10Ssingh) [15:01:41] 10Operations, 10SRE-Access-Requests: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10Dzahn) "Membership of ops group in LDAP and YAML are not identical: ['lmata']" [15:05:43] (03PS1) 10Elukey: profile::reportupdater::jobs: absent old RU job [puppet] - 10https://gerrit.wikimedia.org/r/607537 (https://phabricator.wikimedia.org/T234826) [15:06:09] !log merging backports and running a full scap sync for UBN at T256151 [15:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:13] T256151: Undefined constants NS_LQT_THREAD, NS_LQT_SUMMARY, NS_LQT_SUMMARY_TALK, NS_LQT_THREAD_TALK while rebuilding localization cache - https://phabricator.wikimedia.org/T256151 [15:07:50] (03PS3) 10Jbond: wikidough: improve naming of hiera keys and class variables [puppet] - 10https://gerrit.wikimedia.org/r/607477 (owner: 10Ssingh) [15:08:34] (03CR) 10Brennen Bearnes: [C: 03+2] Define NS_LQT in Lqt.namespaces.php [extensions/LiquidThreads] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/607379 (https://phabricator.wikimedia.org/T256151) (owner: 10Krinkle) [15:08:56] (03CR) 10Brennen Bearnes: [C: 03+2] Define NS_LQT in Lqt.namespaces.php [extensions/LiquidThreads] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/607380 (https://phabricator.wikimedia.org/T256151) (owner: 10Krinkle) [15:11:58] (03Merged) 10jenkins-bot: Define NS_LQT in Lqt.namespaces.php [extensions/LiquidThreads] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/607379 (https://phabricator.wikimedia.org/T256151) (owner: 10Krinkle) [15:12:48] (03Merged) 10jenkins-bot: Define NS_LQT in Lqt.namespaces.php [extensions/LiquidThreads] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/607380 (https://phabricator.wikimedia.org/T256151) (owner: 10Krinkle) [15:13:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/605688 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [15:14:57] (03CR) 10Jcrespo: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [15:16:11] (03PS2) 10Elukey: profile::reportupdater::jobs: absent old RU job [puppet] - 10https://gerrit.wikimedia.org/r/607537 (https://phabricator.wikimedia.org/T234826) [15:16:40] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service: [EPIC] Deploy push-notifications service to production - https://phabricator.wikimedia.org/T256237 (10MSantos) p:05Triage→03High [15:19:18] !log ppchelko@deploy1001 Started deploy [restbase/deploy@9686627]: Release updates to PCS endpoints [15:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:40] !log rolling restart of swift-proxy on thanos-fe[2001-2003].codfw.wmnet,thanos-fe[1001-1003].eqiad.wmnet - T256020 [15:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:45] T256020: Access to the thanos-swift cluster for ChartMuseum - https://phabricator.wikimedia.org/T256020 [15:24:19] !log ppchelko@deploy1001 deploy aborted: Release updates to PCS endpoints (duration: 05m 04s) [15:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:22] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpe [15:24:22] expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:25:02] (03PS1) 10Jbond: wikidough: test restore old params for CI production test [labs/private] - 10https://gerrit.wikimedia.org/r/607540 [15:25:16] (03CR) 10Jbond: [V: 03+2 C: 03+2] wikidough: test restore old params for CI production test [labs/private] - 10https://gerrit.wikimedia.org/r/607540 (owner: 10Jbond) [15:25:55] !log ppchelko@deploy1001 Started deploy [restbase/deploy@386b736]: Revert [15:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:16] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpe [15:26:16] expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:27:26] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 29.79 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:27:56] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:28:10] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpe [15:28:10] expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:28:18] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpe [15:28:18] expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:28:19] ^^ my bad [15:28:23] reverting now [15:28:44] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpe [15:28:44] expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:29:00] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpe [15:29:00] expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:29:07] Pchelolo: was about to run a scap sync; holding for the moment [15:29:14] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpe [15:29:14] expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:31:42] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:31:52] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:32:00] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:32:22] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:32:38] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:32:52] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:33:03] (03PS1) 10Jbond: test with no undersocre [labs/private] - 10https://gerrit.wikimedia.org/r/607541 [15:33:16] (03PS4) 10Jbond: wikidough: improve naming of hiera keys and class variables [puppet] - 10https://gerrit.wikimedia.org/r/607477 (owner: 10Ssingh) [15:33:18] (03CR) 10Jbond: [V: 03+2 C: 03+2] test with no undersocre [labs/private] - 10https://gerrit.wikimedia.org/r/607541 (owner: 10Jbond) [15:33:29] Pchelolo: all clear? [15:33:39] brennen: yeah. apologies for this.. [15:33:48] no worries - thx. going ahead with scap. [15:34:20] (03CR) 10jerkins-bot: [V: 04-1] wikidough: improve naming of hiera keys and class variables [puppet] - 10https://gerrit.wikimedia.org/r/607477 (owner: 10Ssingh) [15:34:25] !log brennen@deploy1001 Started scap: (no justification provided) [15:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:38] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.25 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:36:05] !log kormat@cumin1001 dbctl commit (dc=all): 'Pool db1088 @ 100% into s6 T255927', diff saved to https://phabricator.wikimedia.org/P11652 and previous config saved to /var/cache/conftool/dbconfig/20200624-153604-kormat.json [15:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:09] T255927: db1088 crashed - https://phabricator.wikimedia.org/T255927 [15:38:32] !log previous scap sync for T256151 - [[gerrit:607379]] and [[gerrit:607380]] [15:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:37] T256151: Undefined constants NS_LQT_THREAD, NS_LQT_SUMMARY, NS_LQT_SUMMARY_TALK, NS_LQT_THREAD_TALK while rebuilding localization cache - https://phabricator.wikimedia.org/T256151 [15:38:53] /query nuria [15:38:57] nope :) [15:39:15] (03PS1) 10Herron: logstash: use system openjdk 11 for logging ES7 instances [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) [15:40:52] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/607537 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [15:41:52] (03PS1) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [15:44:05] (03CR) 10Elukey: [C: 03+2] profile::reportupdater::jobs: absent old RU job [puppet] - 10https://gerrit.wikimedia.org/r/607537 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [15:44:49] jbond42: just merged your patch in labs-private [15:45:14] (03PS2) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [15:48:02] (03PS3) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [15:48:27] (03PS2) 10Herron: logstash: use system openjdk 11 for logging ES7 instances [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) [15:50:08] (03PS4) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [15:52:27] (03CR) 10Ssingh: "Thanks! I tried fetching the metrics via curl from prometheus2003 and that worked." [puppet] - 10https://gerrit.wikimedia.org/r/607301 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:52:36] (03PS2) 10Alexandros Kosiaris: lvs: Add new proton TLS service [puppet] - 10https://gerrit.wikimedia.org/r/607531 (https://phabricator.wikimedia.org/T225680) [15:52:38] (03PS2) 10Alexandros Kosiaris: lvs: Switch proton to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/607532 (https://phabricator.wikimedia.org/T225680) [15:52:40] (03PS2) 10Alexandros Kosiaris: lvs: Switch proton to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/607533 (https://phabricator.wikimedia.org/T225680) [15:52:42] (03PS2) 10Alexandros Kosiaris: lvs: Switch proton to production [puppet] - 10https://gerrit.wikimedia.org/r/607534 (https://phabricator.wikimedia.org/T225680) [15:52:44] (03PS2) 10Alexandros Kosiaris: proton: Switch dev restbase to talk to TLS proton [puppet] - 10https://gerrit.wikimedia.org/r/607535 (https://phabricator.wikimedia.org/T225680) [15:52:46] (03PS2) 10Alexandros Kosiaris: proton: Switch restbase production to TLS [puppet] - 10https://gerrit.wikimedia.org/r/607536 (https://phabricator.wikimedia.org/T225680) [15:52:50] (03PS5) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [15:53:16] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@386b736]: Revert (duration: 27m 21s) [15:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:32] (03CR) 10Ssingh: [C: 03+2] prometheus: add wikidough statistics [puppet] - 10https://gerrit.wikimedia.org/r/607301 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [15:56:49] (03PS6) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [15:59:50] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020): CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) It looks like we'll try to do this: ideally we'll aim to do the switchover from eqiad to codfw in either mid-to-late August or early September, a... [16:00:19] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Papaul) a:05Papaul→03Kormat Disk replacement complete [16:01:02] (03PS7) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [16:01:04] (03PS1) 10Elukey: Revert "Revert "Reimage db1108 to Debian Buster"" [puppet] - 10https://gerrit.wikimedia.org/r/607547 [16:01:07] (03PS2) 10Elukey: Revert "Revert "Reimage db1108 to Debian Buster"" [puppet] - 10https://gerrit.wikimedia.org/r/607547 [16:01:45] (03CR) 10jerkins-bot: [V: 04-1] wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 (owner: 10Jbond) [16:04:15] (03PS8) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [16:05:53] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10Halfak) Some notes from the deployment pipeline meeting: * ORES K8s and COW - https://phabricator.wikimedia.org/T182331 ** Takes advantage of CoW on a single machine ** Con... [16:05:55] (03CR) 10Elukey: [C: 03+2] Revert "Revert "Reimage db1108 to Debian Buster"" [puppet] - 10https://gerrit.wikimedia.org/r/607547 (owner: 10Elukey) [16:08:59] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/607525 (owner: 10Hashar) [16:09:01] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar) [16:09:59] (03PS9) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [16:11:07] (03CR) 10jerkins-bot: [V: 04-1] wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 (owner: 10Jbond) [16:12:30] (03PS10) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [16:14:14] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler1002/450/contint2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607524 (owner: 10Hashar) [16:14:44] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler1003/449/doc1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607525 (owner: 10Hashar) [16:15:20] 10Operations, 10ops-codfw, 10decommission-hardware: decommission ganeti200[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T255554 (10Papaul) [16:15:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=wikidough site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:15:40] 10Operations, 10ops-codfw, 10decommission-hardware: decommission ganeti200[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T255554 (10Papaul) 05Open→03Resolved Complete [16:17:28] !log reimage db1108 to debian Buster - T234826 [16:17:30] marostegui: --^ [16:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:32] T234826: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 [16:21:06] (03CR) 10Andrew Bogott: [C: 03+2] dbproxy1021: define openstack_controllers for firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/607513 (owner: 10Andrew Bogott) [16:21:23] (03PS10) 10Dave Pifke: webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) [16:22:32] (03CR) 10jerkins-bot: [V: 04-1] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [16:25:36] (03PS11) 10Dave Pifke: webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) [16:26:37] !log ppchelko@deploy1001 Started deploy [restbase/deploy@5f08f32]: Release PCS endpoints updates, take 2 [16:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:43] (03CR) 10jerkins-bot: [V: 04-1] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [16:27:54] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [16:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:48] (03PS12) 10Dave Pifke: webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) [16:29:14] (03CR) 10Dzahn: wikidough: improve naming of hiera keys and class variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607477 (owner: 10Ssingh) [16:30:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:47] !log brennen@deploy1001 Finished scap: (no justification provided) (duration: 60m 22s) [16:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:31] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [16:38:15] (03CR) 10Krinkle: "As far as I know, the dc prefix does not need to be exposed in any way. I assume the only reason it was created is for internal mcrouter h" [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [16:39:08] (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: Include URLs for listed projects [puppet] - 10https://gerrit.wikimedia.org/r/607218 (owner: 10Aklapper) [16:40:17] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [16:40:18] (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: List projects' color/icon violations [puppet] - 10https://gerrit.wikimedia.org/r/607222 (https://phabricator.wikimedia.org/T249806) (owner: 10Aklapper) [16:40:30] (03PS2) 10Dzahn: phabricator weekly changes email: List projects' color/icon violations [puppet] - 10https://gerrit.wikimedia.org/r/607222 (https://phabricator.wikimedia.org/T249806) (owner: 10Aklapper) [16:40:37] (03CR) 10Dzahn: [C: 03+2] "+--------------------------------------------------------+---------------+-----------+------+" [puppet] - 10https://gerrit.wikimedia.org/r/607222 (https://phabricator.wikimedia.org/T249806) (owner: 10Aklapper) [16:40:48] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@5f08f32]: Release PCS endpoints updates, take 2 (duration: 14m 11s) [16:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:03] !log ppchelko@deploy1001 Started deploy [restbase/deploy@5f08f32]: Release PCS endpoints updates, feeds timed out, redo [16:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:22] (03CR) 10RLazarus: "Perfect, thanks. In that case we might unify them in future (per the existing TODO) but I won't worry about it for now. Everything else ab" [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [16:43:01] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [16:46:14] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@5f08f32]: Release PCS endpoints updates, feeds timed out, redo (duration: 05m 11s) [16:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:54] 10Operations, 10SRE-Access-Requests: Requesting access to PROD for lmata (SRE) - https://phabricator.wikimedia.org/T254818 (10Dzahn) 05Resolved→03Open [16:51:40] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [16:51:40] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [16:53:25] (03CR) 10Herron: "PCC https://puppet-compiler.wmflabs.org/compiler1002/23445/" [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [16:55:12] (03PS3) 10Herron: logstash: use system openjdk 11 for logging ES7 instances [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) [16:56:51] !log update archiva-deploy user's password in Jenkins credentials plugin [16:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:06] (03PS2) 10Dzahn: meet::accountmanager: add some fake private secrets (example) [labs/private] - 10https://gerrit.wikimedia.org/r/607153 [17:01:08] (03PS1) 10Dzahn: xhgui: add new fake password for new mysql db [labs/private] - 10https://gerrit.wikimedia.org/r/607567 (https://phabricator.wikimedia.org/T254795) [17:01:20] (03PS2) 10Dzahn: xhgui: add new fake password for new mysql db [labs/private] - 10https://gerrit.wikimedia.org/r/607567 (https://phabricator.wikimedia.org/T254795) [17:02:09] (03CR) 10Dzahn: [V: 03+2 C: 03+2] xhgui: add new fake password for new mysql db [labs/private] - 10https://gerrit.wikimedia.org/r/607567 (https://phabricator.wikimedia.org/T254795) (owner: 10Dzahn) [17:03:06] (03PS4) 10Herron: logstash: use system openjdk 11 for logging ES7 instances [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) [17:04:05] (03CR) 10Jbond: [C: 04-1] "just gonna -1 this while i debug" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607477 (owner: 10Ssingh) [17:06:25] (03PS11) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [17:06:27] (03PS1) 10Elukey: Remove the analytics-slave CNAME [dns] - 10https://gerrit.wikimedia.org/r/607569 (https://phabricator.wikimedia.org/T234826) [17:06:42] (03CR) 10Herron: "updated PCC https://puppet-compiler.wmflabs.org/compiler1001/23447/" [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [17:08:09] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [17:08:50] (03PS1) 10Ssingh: prometheus: update scheme for wikidough (improves ab8a948a) [puppet] - 10https://gerrit.wikimedia.org/r/607570 [17:09:21] (03CR) 10Ottomata: [C: 03+1] Remove the analytics-slave CNAME [dns] - 10https://gerrit.wikimedia.org/r/607569 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [17:09:26] (03PS12) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [17:10:39] (03CR) 10Elukey: [C: 03+2] Remove the analytics-slave CNAME [dns] - 10https://gerrit.wikimedia.org/r/607569 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [17:12:06] (03PS13) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [17:12:26] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/23450/prometheus2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607570 (owner: 10Ssingh) [17:12:30] (03PS13) 10Dzahn: webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:17:41] 10Operations, 10serviceops: SRE FY2019-20 Q3 goal: Increase reach of deployment pipeline - https://phabricator.wikimedia.org/T212935 (10Aklapper) [17:18:02] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [17:18:11] (03PS1) 10Volans: scripts: unset rack/position in offline script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607571 [17:23:29] (03CR) 10Ssingh: "> https://puppet-compiler.wmflabs.org/compiler1001/23450/prometheus2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607570 (owner: 10Ssingh) [17:24:46] (03CR) 10Dzahn: "> Adding the DB is being tracked in T254795." [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:26:18] (03CR) 10Dzahn: "> Patch Set 13:" [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:27:21] (03CR) 10Dzahn: "PS13: i also moved the admin group to the new role name in Hiera, to ensure the same admins still have access as before when role name cha" [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:31:19] 10Operations, 10Release-Engineering-Team-TODO, 10Scap, 10Release-Engineering-Team (Deployment services), and 3 others: scap's logstash_checker.py is blissfully unaware of any logstash indexing latency - https://phabricator.wikimedia.org/T255197 (10thcipriani) [17:31:47] !log update archiva-ci user's password in Jenkins credentials plugin [17:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:22] (03PS4) 10Ppchelko: EventBus: Emit kafka purges for everything [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607298 (https://phabricator.wikimedia.org/T250781) [17:38:04] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607571 (owner: 10Volans) [17:38:54] (03CR) 10Volans: [C: 03+2] scripts: unset rack/position in offline script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607571 (owner: 10Volans) [17:43:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [17:43:10] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=- method=POST https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [17:44:56] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [17:49:34] 10Operations, 10serviceops: Clean up the /*/mw/ mcrouter routing prefix - https://phabricator.wikimedia.org/T256291 (10RLazarus) p:05Triage→03Low [17:53:31] (03CR) 10Dzahn: [C: 04-1] "@contint1001:~# cd /srv/deployment/integration/docroot/" [puppet] - 10https://gerrit.wikimedia.org/r/607076 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [17:53:46] (03PS11) 10Elukey: Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) [17:54:20] (03CR) 10Elukey: "fixed a rebase conflict in site.pp, nothing else." [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T1800). [18:00:04] Pchelolo and Jdlrobson: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:04] brennen and hashar: My dear minions, it's time we take the moon! Just kidding. Time for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T1800). [18:00:48] here o/. [18:00:52] Jdlrobson: I'll do it [18:00:56] thanks! [18:01:01] I'll begin with yours [18:01:05] mine doesnt touch production [18:01:07] so should be straightforward [18:01:20] \o [18:01:21] (the config to production is a noop) [18:01:42] (03PS3) 10Ppchelko: Enable click tracking in Vector on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607136 (https://phabricator.wikimedia.org/T250282) (owner: 10Jdlrobson) [18:01:47] (03CR) 10Ppchelko: [C: 03+2] Enable click tracking in Vector on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607136 (https://phabricator.wikimedia.org/T250282) (owner: 10Jdlrobson) [18:02:37] (03Merged) 10jenkins-bot: Enable click tracking in Vector on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607136 (https://phabricator.wikimedia.org/T250282) (owner: 10Jdlrobson) [18:03:31] (03PS1) 10Volans: scripts: allow to offline multiple devices at once [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607578 [18:06:08] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Enable click tracking in Vector on beta cluster gerrit:607136 IS-labs.php (duration: 01m 07s) [18:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:33] Jdlrobson: pulled on mwdebug1002, synced the labs config [18:06:47] I guess there's nothing really to test? [18:07:26] looks good to me Pchelolo [18:07:30] ok. [18:08:09] (03PS4) 10Ppchelko: Enable MediaModeration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607327 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:08:22] CindyCicaleseWMF: your's next [18:08:40] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable click tracking in Vector on beta cluster gerrit:607136 IS.php (duration: 01m 05s) [18:08:42] thanks [18:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:57] Jdlrobson: yours done [18:09:07] (03CR) 10Ppchelko: [C: 03+2] Enable MediaModeration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607327 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:09:16] Pchelolo: hmm that said it doesn't seem to be working on the beta cluster yet. im hoping caching [18:09:34] Jdlrobson: on beta it's pulled when it's merged [18:09:48] so the deploy of IS-labs.php is a noop really [18:10:01] (03Merged) 10jenkins-bot: Enable MediaModeration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607327 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:10:03] so, you gotta wait a bit [18:10:16] Pchelolo: ok! will keep a look out [18:10:45] there we go [18:11:28] CindyCicaleseWMF: ok, so yours is on mwdebug1002 [18:11:36] is there a way to test it really? [18:11:56] check it's on special:version? ;P [18:11:56] (03CR) 10Volans: [C: 03+1] "LGTM, feel free to ignore the single comment." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [18:12:01] I guess at this point just see that it shows up on Special:Version on a group0 wiki? [18:12:07] ok [18:12:20] yeah, what Reedy said ;-) [18:12:58] ok, it shows [18:13:20] any other checks you can do? [18:14:04] uploading one of the test images and running the maintenance script from the command line [18:14:31] mmm okey.. [18:14:39] Pchelolo: please let me know once you're done :) [18:14:54] sure Urbanecm. testing this one and one more left [18:15:00] ack :) [18:16:22] But, we don't need to test that in the deploy window, do we? The key was seeing that the extension is enabled. [18:16:32] thanks Pchelolo all done here :) [18:17:37] CindyCicaleseWMF: Not really. As long as it's not causing errors, should be fine as is [18:18:12] (03CR) 10Volans: [C: 03+1] "Nice, looks pretty much good to me. I'll leave the specific hadoop logic to you and Andrew." (0316 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [18:18:23] CindyCicaleseWMF: done. seems that it all works [18:18:30] syncing [18:18:34] excellent [18:19:02] like, you'd want to upload more of the test files and run the script afterwards, but it doesn't seem to break anything [18:19:23] cool [18:19:29] (03PS5) 10Ppchelko: EventBus: Emit kafka purges for everything [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607298 (https://phabricator.wikimedia.org/T250781) [18:19:38] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable MediaModeration on group0 gerrit:607327 (duration: 01m 04s) [18:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:51] (03CR) 10Ppchelko: [C: 03+2] EventBus: Emit kafka purges for everything [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607298 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:20:46] (03Merged) 10jenkins-bot: EventBus: Emit kafka purges for everything [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607298 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:25:14] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventBus: Emit kafka purges for everything gerrit:607298 (duration: 01m 05s) [18:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:29] (03PS1) 10Urbanecm: Revert "IS: Cleanup some redundant rows." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607586 (https://phabricator.wikimedia.org/T256279) [18:26:05] Urbanecm: all done here [18:26:13] thanks! [18:26:24] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607586 (https://phabricator.wikimedia.org/T256279) (owner: 10Urbanecm) [18:27:16] (03Merged) 10jenkins-bot: Revert "IS: Cleanup some redundant rows." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607586 (https://phabricator.wikimedia.org/T256279) (owner: 10Urbanecm) [18:29:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: dea9214: Revert "IS: Cleanup some redundant rows." (T256279) (duration: 01m 05s) [18:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:11] T256279: Upload link changed from local Special:Upload to Special:UploadWizard on Commons - https://phabricator.wikimedia.org/T256279 [18:33:15] (03PS1) 10Urbanecm: Define Rekonstruktion NS for dewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607588 (https://phabricator.wikimedia.org/T256242) [18:35:09] (03PS1) 10Urbanecm: Set WP as a NS_PROJECT alias for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607589 (https://phabricator.wikimedia.org/T255941) [18:35:11] (03PS2) 10Urbanecm: Define Rekonstruktion NS for dewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607588 (https://phabricator.wikimedia.org/T256242) [18:35:16] (03CR) 10Urbanecm: [C: 03+2] Define Rekonstruktion NS for dewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607588 (https://phabricator.wikimedia.org/T256242) (owner: 10Urbanecm) [18:36:14] (03Merged) 10jenkins-bot: Define Rekonstruktion NS for dewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607588 (https://phabricator.wikimedia.org/T256242) (owner: 10Urbanecm) [18:36:16] (03PS1) 10Ppchelko: Enable kafka purges on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607590 (https://phabricator.wikimedia.org/T250781) [18:36:19] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Papaul) Return information {F31903968} [18:36:38] (03PS2) 10Ppchelko: Enable kafka purges on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607590 (https://phabricator.wikimedia.org/T250781) [18:36:51] (03PS2) 10Urbanecm: Set WP as a NS_PROJECT alias for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607589 (https://phabricator.wikimedia.org/T255941) [18:37:12] (03CR) 10Urbanecm: [C: 03+2] Set WP as a NS_PROJECT alias for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607589 (https://phabricator.wikimedia.org/T255941) (owner: 10Urbanecm) [18:38:04] (03Merged) 10jenkins-bot: Set WP as a NS_PROJECT alias for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607589 (https://phabricator.wikimedia.org/T255941) (owner: 10Urbanecm) [18:38:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 2b93e0f: Define Rekonstruktion NS for dewiktionary (T256242) (duration: 01m 05s) [18:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:17] T256242: New namespace for german wiktionary - https://phabricator.wikimedia.org/T256242 [18:38:50] (03CR) 10Krinkle: [C: 03+1] webperf: Remove XHGui dependency on MongoDB [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [18:38:56] !log Run mwscript namespaceDupes.php dewiktionary --fix (T256242) [18:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:14] (03PS3) 10Ppchelko: Enable kafka purges on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607590 (https://phabricator.wikimedia.org/T250781) [18:41:14] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c6d6c85: Set WP as a NS_PROJECT alias for banwiki (T255941) (duration: 01m 06s) [18:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:18] T255941: Add WP alias on banwiki - https://phabricator.wikimedia.org/T255941 [18:41:54] !log Run mwscript namespaceDupes.php --wiki=banwiki --fix (T255941) [18:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:17] 10Operations, 10ops-ulsfo, 10DC-Ops: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300 (10RobH) [18:42:27] 10Operations, 10ops-ulsfo, 10DC-Ops: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300 (10RobH) [18:42:34] !log mwscript namespaceDupes.php --wiki=banwiki --add-prefix=T255941 --fix (T255941) [18:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:02] (03PS4) 10Urbanecm: Set namespace aliases for guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605973 (https://phabricator.wikimedia.org/T255358) (owner: 10Jayprakash12345) [18:43:08] (03CR) 10Ppchelko: [C: 04-2] "Clocked by deployment of I948b35f262383ff5edd9633694811a5cd4596500" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607590 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:43:12] (03CR) 10Urbanecm: [C: 03+2] Set namespace aliases for guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605973 (https://phabricator.wikimedia.org/T255358) (owner: 10Jayprakash12345) [18:43:29] (03PS1) 10Ppchelko: Disable HTCP purging everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607593 (https://phabricator.wikimedia.org/T250781) [18:44:28] (03Merged) 10jenkins-bot: Set namespace aliases for guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/605973 (https://phabricator.wikimedia.org/T255358) (owner: 10Jayprakash12345) [18:44:48] (03CR) 10Ppchelko: "@Ema this is blocked until next week when some code changes get on the train, but in the meantime could you verify that 239.128.0.115 also" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607593 (https://phabricator.wikimedia.org/T250781) (owner: 10Ppchelko) [18:46:47] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 2a1dfc5: Set namespace aliases for guwiki (T255358) (duration: 01m 05s) [18:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:52] T255358: Set namespace aliases in guwikipedia - https://phabricator.wikimedia.org/T255358 [18:47:03] (03PS1) 10Ppchelko: Cleanup: remove temporary wmgDisableHTCP variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607596 (https://phabricator.wikimedia.org/T250781) [18:47:11] !log joal@deploy1001 Started deploy [analytics/refinery@1112749]: Regular analytics weekly train [analytics/refinery@1112749] [18:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:36] jouncebot: now [18:47:36] For the next 0 hour(s) and 12 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T1800) [18:47:36] For the next 0 hour(s) and 12 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T1800) [18:47:38] !log mwscript namespaceDupes.php --wiki=guwiki --fix (T255358) [18:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:12] * Urbanecm is done [18:49:24] !log Morning B&C deploy window is done [18:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:06] (03CR) 10CRusnov: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607578 (owner: 10Volans) [18:52:58] (03Abandoned) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/604797 (owner: 10Herron) [18:53:02] !log joal@deploy1001 Finished deploy [analytics/refinery@1112749]: Regular analytics weekly train [analytics/refinery@1112749] (duration: 05m 50s) [18:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:33] (03CR) 10Dzahn: [C: 03+2] scap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507072 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [18:53:37] !log joal@deploy1001 Started deploy [analytics/refinery@1112749] (thin): Regular analytics weekly train THIN [analytics/refinery@1112749] [18:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:44] arr [18:53:46] !log joal@deploy1001 Finished deploy [analytics/refinery@1112749] (thin): Regular analytics weekly train THIN [analytics/refinery@1112749] (duration: 00m 09s) [18:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:51] jouncebot: now [18:53:51] For the next 0 hour(s) and 6 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T1800) [18:53:51] For the next 0 hour(s) and 6 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T1800) [18:55:43] (03CR) 10Dzahn: "Why? This seems like a downgrade." [puppet] - 10https://gerrit.wikimedia.org/r/607570 (owner: 10Ssingh) [18:56:54] (03CR) 10Dzahn: "you should probably ask godog about this" [puppet] - 10https://gerrit.wikimedia.org/r/607570 (owner: 10Ssingh) [18:58:12] 10Operations, 10Traffic: Certain links being rejected by caching if opened in Internet Explorer - https://phabricator.wikimedia.org/T256302 (10Urbanecm) [19:00:04] brennen and hashar: May I have your attention please! Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T1900) [19:00:08] 10Operations, 10Traffic: Certain links being rejected by caching if opened in Internet Explorer - https://phabricator.wikimedia.org/T256302 (10Urbanecm) [19:01:38] !log train 1.35.0-wmf.38: finished triage meeting, clear to proceed to group 1 (T254175) [19:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:42] T254175: 1.35.0-wmf.38 deployment blockers - https://phabricator.wikimedia.org/T254175 [19:01:43] (03CR) 10Volans: [C: 03+2] scripts: allow to offline multiple devices at once [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607578 (owner: 10Volans) [19:04:29] (03PS1) 10Brennen Bearnes: group1 wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607599 [19:04:31] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607599 (owner: 10Brennen Bearnes) [19:05:25] (03CR) 10Dzahn: Add initial puppetization for libraryupgrader (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [19:05:27] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607599 (owner: 10Brennen Bearnes) [19:05:37] (just noticed there's a fresh blocker; holding while i catch up) [19:06:25] (gah, ok, for .39. false alarm.) [19:06:34] brennen: fyi, i just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/507072 in the unlikely event there is some issue with scap [19:06:58] mutante: thanks for heads up, will keep an eye out. [19:07:00] the git clone URL changed (like it already did for most things) to keep up with new gerrit versions [19:07:05] cool, thx [19:10:22] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.38 [19:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:28] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.38 (duration: 01m 04s) [19:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:45] (03CR) 10Dzahn: [C: 03+1] "+1 but pending db creation. I don't know how the password will get into PrivateSettings.ini though." [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [19:15:01] rolling back. [19:15:44] (03CR) 10Dzahn: [C: 04-1] "Systemd::Sysuser[planet]: parameter 'content' index 0 expects a value for key 'id'" [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [19:16:07] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:17:39] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert group1 wikis to 1.35.0-wmf.37 [19:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:00] (03PS1) 10Brennen Bearnes: Revert "group1 wikis to 1.35.0-wmf.38" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607602 [19:20:02] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group1 wikis to 1.35.0-wmf.38" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607602 (owner: 10Brennen Bearnes) [19:20:56] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.35.0-wmf.38" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607602 (owner: 10Brennen Bearnes) [19:22:19] (03PS1) 10Reedy: [DNM] Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607603 [19:23:14] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607603 (owner: 10Reedy) [19:24:57] 10Operations, 10netops: Peer with SFMIX at ulsfo (May 2020) - https://phabricator.wikimedia.org/T251536 (10RobH) [19:27:56] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:29:36] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [mostread] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:29:38] (03Abandoned) 10Reedy: [DNM] Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607603 (owner: 10Reedy) [19:30:14] (03PS1) 10Reedy: Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607605 [19:34:25] (03PS1) 10Ottomata: Revert "Migrate SearchSatisfaction from EventLogging to EventGate on group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607606 [19:35:06] (03PS3) 10Dzahn: planet: replace system/user group with systemd-sysuser [puppet] - 10https://gerrit.wikimedia.org/r/606287 [19:35:08] (03PS2) 10Ottomata: Revert "Migrate SearchSatisfaction from EventLogging to EventGate on group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607606 [19:35:13] (03PS3) 10Ottomata: Revert "Migrate SearchSatisfaction from EventLogging to EventGate on group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607606 [19:36:58] (03CR) 10Ottomata: [C: 03+2] Revert "Migrate SearchSatisfaction from EventLogging to EventGate on group1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607606 (owner: 10Ottomata) [19:37:37] 10Operations, 10Wikimedia-Mailing-lists: Create secondary mailinglist for german arbcom - https://phabricator.wikimedia.org/T256306 (10Luke081515) [19:38:52] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert Migrate SearchSatisfaction from EventLogging to EventGate on group1 - T249261 (duration: 01m 06s) [19:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:57] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [19:39:05] (03CR) 10Dzahn: [C: 04-1] "this was meant to be a test to use the new system-sysuser. fyi, it currently fails with: Duplicate declaration: Exec[Refresh sysusers] is" [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [19:44:04] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:47:53] (03CR) 10Volans: [C: 03+1] "LGTM, feel free to merge and test them." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/604678 (https://phabricator.wikimedia.org/T246890) (owner: 10Jbond) [19:50:43] (03CR) 10Krinkle: [C: 03+1] "It will be added there manually via the deployment host in the private/.git repo." [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [19:51:26] (03CR) 10Krinkle: [C: 03+1] "The variables are already created in prod but set to null, which wmf-config uses to decide whether or not to send debug profiles, so this " [puppet] - 10https://gerrit.wikimedia.org/r/603550 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [19:54:35] jouncebot: now [19:54:35] For the next 1 hour(s) and 5 minute(s): Mediawiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T1900) [19:56:34] (03CR) 10Krinkle: "Config impact:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607605 (owner: 10Reedy) [19:59:46] (03PS1) 10CDanis: fix multiple invocations of systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/607609 [20:00:04] halfak and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T2000). [20:00:18] Deploying ores! [20:00:28] !log halfak@deploy1001 Started deploy [ores/deploy@1b87365]: T254505 [20:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:33] T254505: Rebuild all models with revscoring-2.8.2 - https://phabricator.wikimedia.org/T254505 [20:01:03] (03CR) 10jerkins-bot: [V: 04-1] fix multiple invocations of systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/607609 (owner: 10CDanis) [20:01:56] (03PS2) 10CDanis: fix multiple invocations of systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/607609 [20:03:07] (03CR) 10jerkins-bot: [V: 04-1] fix multiple invocations of systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/607609 (owner: 10CDanis) [20:03:44] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10AntiCompositeNumber) [20:04:28] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10AntiCompositeNumber) [20:06:05] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@80c763d]: Update mobileapps to a413db4f [20:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:29] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10Wilfredor) this takes a few days before the image is actually updated, thus making it impossible to correct image... [20:08:12] All looks good on the canary. Continuing. [20:09:43] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@80c763d]: Update mobileapps to a413db4f (duration: 03m 37s) [20:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:22] (03PS14) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [20:14:36] !log halfak@deploy1001 Finished deploy [ores/deploy@1b87365]: T254505 (duration: 14m 08s) [20:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:40] T254505: Rebuild all models with revscoring-2.8.2 - https://phabricator.wikimedia.org/T254505 [20:15:51] 10Operations, 10MediaWiki-Vagrant, 10phan: It should be possible to install php-ast using apt-get on MediaWiki-Vagrant - https://phabricator.wikimedia.org/T234240 (10Lokal_Profil) >>! In T234240#6230719, @Lokal_Profil wrote: > @Mainframe98 Many thanks ! Now as a shell script in case that might come in handy... [20:17:38] (03PS15) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [20:18:57] (03PS1) 10Herron: lists: copy incoming mail to standby server [puppet] - 10https://gerrit.wikimedia.org/r/607612 (https://phabricator.wikimedia.org/T224586) [20:20:10] (03CR) 10jerkins-bot: [V: 04-1] lists: copy incoming mail to standby server [puppet] - 10https://gerrit.wikimedia.org/r/607612 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [20:22:19] (03PS16) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [20:23:08] (03PS2) 10Herron: lists: copy incoming mail to standby server [puppet] - 10https://gerrit.wikimedia.org/r/607612 (https://phabricator.wikimedia.org/T224586) [20:24:33] Amir1: restarting fpm on that box seems reasonable, but i haven't actually done that before. is this what i want? https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#PHP7_opcache_health [20:25:00] (03PS17) 10Jbond: wikidough: testing merge [puppet] - 10https://gerrit.wikimedia.org/r/607543 [20:25:12] restart on app server in question, re-try deploy? [20:25:15] Looks like ORES is all good. Declaring victory [20:25:23] brennen: I have never done it before, elukey did it (but he's away) [20:25:45] (03PS1) 10BearND: mobileapps: deploy 2020-06-24-200029-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/607614 [20:25:52] brennen: I don't know if it restarts fpm [20:25:59] Amir1: ack. i'll ask around. [20:26:14] Thanks! [20:26:25] (03CR) 10BearND: [C: 03+2] mobileapps: deploy 2020-06-24-200029-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/607614 (owner: 10BearND) [20:26:46] (03PS3) 10CDanis: fix multiple invocations of systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/607609 [20:26:52] (03Merged) 10jenkins-bot: mobileapps: deploy 2020-06-24-200029-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/607614 (owner: 10BearND) [20:28:30] (03CR) 10CDanis: "PCC lgtm https://puppet-compiler.wmflabs.org/compiler1001/23461/" [puppet] - 10https://gerrit.wikimedia.org/r/607609 (owner: 10CDanis) [20:28:32] !log bsitzmann@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [20:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:49] brennen: https://phabricator.wikimedia.org/T243009 This means deploying without force should restart fpm [20:29:30] by force I mean --force not Star Wars force [20:30:09] !log bsitzmann@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [20:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:19] sorry I'm just catching up -- is there a server with corrupted opcache? [20:30:24] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:30:34] mw1287 [20:30:45] Isn't the same one that had opcache issue before? [20:30:51] brennen: that's the command, and it is always safe [20:30:54] T256305 [20:30:54] T256305: Fatal Error: Class MediaWiki\HookContainer\HookRunner contains 1 abstract method and must therefore be declared abstract - https://phabricator.wikimedia.org/T256305 [20:30:58] Amir1: at this point i think half the fleet has had it 🙃 [20:31:17] :( [20:32:04] !log bsitzmann@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [20:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:18] !log restarting php-fpm on mw1287 T256305 [20:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:41] cdanis: cool, thanks. i'll give the deploy another shot. [20:35:21] hm [20:35:29] we should prrrrrobably make deployers be able to run that command [20:36:19] hrm, yeah, i didn't attempt it, so wasn't aware i _couldn't_. [20:37:39] (03PS1) 10Brennen Bearnes: group1 wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607616 [20:37:39] can you try it now just to make sure? :D [20:37:41] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607616 (owner: 10Brennen Bearnes) [20:37:53] cdanis: yep, one sec [20:38:29] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607616 (owner: 10Brennen Bearnes) [20:38:46] cdanis: nope, password prompt. [20:38:50] ack, ty [20:38:58] np [20:39:18] brennen: cdanis T243009 [20:39:19] T243009: Make scap skip restarting php-fpm when using --force - https://phabricator.wikimedia.org/T243009 [20:39:23] ... huh. apparently I wrote https://gerrit.wikimedia.org/r/c/operations/puppet/+/531291 almost a year ago? I have no memory of this 😂 [20:39:32] Amir1: yeah I just want people to be able to invoke it by hand, when it is necessary [20:39:44] cool [20:40:05] Thank you! [20:41:22] !log train 1.35.0-wmf.38: attempting to roll forward to group1 after php-fpm restart on mw1287 (T256305, T254175) [20:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:27] T254175: 1.35.0-wmf.38 deployment blockers - https://phabricator.wikimedia.org/T254175 [20:41:27] T256305: Fatal Error: Class MediaWiki\HookContainer\HookRunner contains 1 abstract method and must therefore be declared abstract - https://phabricator.wikimedia.org/T256305 [20:41:36] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.38 [20:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:55] (03PS1) 10CDanis: admin: allow deployers to execute restart-php7.2-fpm [puppet] - 10https://gerrit.wikimedia.org/r/607617 [20:42:43] !log brennen@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.38 (duration: 01m 06s) [20:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:32] so far so good. [20:44:01] (03CR) 10RLazarus: [C: 03+1] admin: allow deployers to execute restart-php7.2-fpm [puppet] - 10https://gerrit.wikimedia.org/r/607617 (owner: 10CDanis) [20:44:28] FWIW, every sync should restart php-fpm *if* it determines it's within some threashold of the cache limit [20:44:47] which is to say: deployers kinda/sorta can do this [20:45:02] I don't know *how* offhand [20:45:04] thcipriani: right, but, I don't think that perfectly tracks corruption occurring [20:45:15] oh, no [20:45:26] oh, you just mean, right okay [20:45:27] yeah :) [20:45:31] I get what you're saying now [20:46:20] there could probably be a scap command for doing it safely if that's something we want/need [20:46:35] this is a bit more direct in that it always does it; there's still a TODO at the top of the python file to add some concurrency-limiting via poolcounter; in theory you could cause an outage by invoking it en masse across the whole cluster simultaneously ... but, well, there's plenty of ways for deployers to already do that ;) [20:46:57] :D [20:47:05] scap cmd for it might not be the worst idea, but being able to invoke directly on a box seems fine. [20:47:37] scap is "smart" about restarts for some value of smart [20:47:46] that is, it does each group of servers in small batches [20:48:01] I think 10% per group at once, IIRC [20:48:32] (03CR) 10CDanis: [C: 03+2] "PCC LGTM https://puppet-compiler.wmflabs.org/compiler1003/23462/mw1281.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/607617 (owner: 10CDanis) [20:48:35] at any rate: someone smarter than I determined that this rate wouldn't cause an outage [20:50:09] brennen: if you have a minute, can you confirm it works for you on mw1287 now? just reran puppet there [20:51:42] there was a scap command for restarts that we ripped out a couple of years ago :) [20:51:58] it was sketchy then [20:51:59] mostly I wish we had a working opcache [20:53:43] the way scap does restarts now is mostly not in scap; shells out to some script that queries php that does restarts :| [20:53:54] cdanis: https://phabricator.wikimedia.org/P11654 [20:54:18] ehm [20:54:20] scap just handles the grouping, rate, and ssh stuffs [20:55:08] * cdanis facepalm [20:58:06] brennen: please try now? [20:59:53] that looks like it worked [21:00:10] cdanis: yep! https://phabricator.wikimedia.org/P11654 [21:00:25] although unsure about the warning [21:00:32] [WARNING] LB lvs1015:9090 reports pool api_80 as disabled/up/not pooled, should be enabled/up/pooled [21:00:36] I think that's fine [21:00:48] cool. and thanks for the assist! [21:04:57] (03PS1) 10CDanis: admin: actually allow deployers to execute restart-php7.2-fpm [puppet] - 10https://gerrit.wikimedia.org/r/607621 [21:07:26] (03CR) 10RLazarus: [C: 03+1] admin: actually allow deployers to execute restart-php7.2-fpm [puppet] - 10https://gerrit.wikimedia.org/r/607621 (owner: 10CDanis) [21:09:39] (03PS2) 10CDanis: admin: actually allow deployers to execute restart-php7.2-fpm [puppet] - 10https://gerrit.wikimedia.org/r/607621 [21:11:43] (03CR) 10CDanis: [C: 03+2] admin: actually allow deployers to execute restart-php7.2-fpm [puppet] - 10https://gerrit.wikimedia.org/r/607621 (owner: 10CDanis) [21:19:22] (03CR) 10Dzahn: [C: 04-1] "see: https://gerrit.wikimedia.org/r/c/operations/puppet/+/607609" [puppet] - 10https://gerrit.wikimedia.org/r/606287 (owner: 10Dzahn) [21:24:04] (03PS3) 10Ottomata: Add eventlogging_legacy job to camus improt and refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) [21:24:38] (03CR) 10Cwhite: [C: 03+2] set disable_fsnotify for all current mtail usage [puppet] - 10https://gerrit.wikimedia.org/r/605688 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [21:25:17] (03CR) 10jerkins-bot: [V: 04-1] Add eventlogging_legacy job to camus improt and refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [21:25:55] (03PS4) 10Ottomata: Add eventlogging_legacy job to camus ingest and refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) [21:27:05] (03CR) 10jerkins-bot: [V: 04-1] Add eventlogging_legacy job to camus ingest and refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [21:27:26] (03PS5) 10Ottomata: \Add eventlogging_legacy job to camus ingest and refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) [21:28:36] (03CR) 10jerkins-bot: [V: 04-1] \Add eventlogging_legacy job to camus ingest and refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [21:29:01] (03PS6) 10Ottomata: \Add eventlogging_legacy job to camus ingest and refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) [21:29:05] (03CR) 10Cwhite: [C: 03+2] Filter the files watched for modifications to only those configured by `-logs` and `-progs`. [debs/mtail] (cross_dist_build) - 10https://gerrit.wikimedia.org/r/607144 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite) [21:30:12] (03CR) 10jerkins-bot: [V: 04-1] \Add eventlogging_legacy job to camus ingest and refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [21:31:24] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:32:30] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:38:20] (03PS2) 10Dzahn: icinga: move ferm rules from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/606730 (https://phabricator.wikimedia.org/T114209) [21:42:23] 10Operations, 10Android-app-Bugs, 10Parsoid, 10Traffic, and 6 others: Right-to-Left directionality problem with refs - https://phabricator.wikimedia.org/T251983 (10bearND) @Traffic Is mwmaint1002 still the host to use when using `mwscript purgeList.php`? On that host I've tried `echo 'https://meta.wikimed... [21:44:29] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10King_of_Hearts) I can report having seen this problem in several incidences as well, including https://upload.wiki... [21:45:28] !log install mtail 3.0.0~rc35+wmf2 on logstash1007 - T255776 [21:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:32] T255776: mtail "syscall spam" / high cpu usage on logstash1023 - https://phabricator.wikimedia.org/T255776 [21:46:17] (03PS7) 10Ottomata: \Add eventlogging_legacy job to camus ingest and refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) [21:47:29] (03CR) 10jerkins-bot: [V: 04-1] \Add eventlogging_legacy job to camus ingest and refine EventLogging events from EventGate [puppet] - 10https://gerrit.wikimedia.org/r/593610 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [21:51:38] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23463/icinga1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/606730 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [21:59:23] (03CR) 10Dzahn: "noop on both prod icinga servers" [puppet] - 10https://gerrit.wikimedia.org/r/606730 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [22:01:38] (03PS2) 10Dzahn: site: add releases role to releases1002/2002 [puppet] - 10https://gerrit.wikimedia.org/r/606022 (https://phabricator.wikimedia.org/T247652) [22:02:00] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [22:02:02] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:02:04] 10Operations, 10Android-app-Bugs, 10Parsoid, 10Traffic, and 6 others: Right-to-Left directionality problem with refs - https://phabricator.wikimedia.org/T251983 (10bearND) a:05bearND→03None [22:02:18] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:02:18] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:03:34] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:13:14] (03PS1) 10Cwhite: mtail: ensure package present and set logstash1007 mtail::from_component [puppet] - 10https://gerrit.wikimedia.org/r/607630 (https://phabricator.wikimedia.org/T255776) [22:15:26] PROBLEM - MariaDB Replica Lag: s4 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1041.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:20:09] (03PS2) 10Cwhite: mtail: set logstash1007 to install mtail from_component [puppet] - 10https://gerrit.wikimedia.org/r/607630 (https://phabricator.wikimedia.org/T255776) [22:23:24] (03CR) 10Cwhite: [C: 03+2] "PCC checks out https://puppet-compiler.wmflabs.org/compiler1003/23466/" [puppet] - 10https://gerrit.wikimedia.org/r/607630 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite) [22:27:00] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/23464/releases2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/606022 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [22:40:37] PROBLEM - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:43:34] (03PS1) 10Dzahn: add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) [22:44:00] (03CR) 10jerkins-bot: [V: 04-1] add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [22:45:39] (03PS2) 10Dzahn: add logstash1030 and logstash1031 [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) [22:47:04] (03CR) 10Legoktm: Add initial puppetization for libraryupgrader (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [22:47:21] (03PS3) 10Legoktm: Add initial puppetization for libraryupgrader [puppet] - 10https://gerrit.wikimedia.org/r/607452 (https://phabricator.wikimedia.org/T173478) [22:48:21] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:05] PROBLEM - jenkins_service_running on releases1002 is CRITICAL: PROCS CRITICAL: 2 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [22:54:31] ^ me, not in prod, just got added [22:55:20] ACKNOWLEDGEMENT - Check systemd state on releases1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:20] ACKNOWLEDGEMENT - jenkins_service_running on releases1002 is CRITICAL: PROCS CRITICAL: 2 processes with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war daniel_zahn new install https://wikitech.wikimedia.org/wiki/Jenkins [22:55:25] ACKNOWLEDGEMENT - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn new install https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200624T2300). [23:00:51] (03PS1) 10Dzahn: add logstash2030 and logstash2031 [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) [23:01:16] (03CR) 10jerkins-bot: [V: 04-1] add logstash2030 and logstash2031 [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [23:02:49] !log releases1002/2002 - disabling puppet, removing failing cron job to pull deployment_charts (because /srv/deployment-charts does not exist yet) [23:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:03] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:04:13] (03PS2) 10Dzahn: add logstash2030 and logstash2031 [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) [23:04:36] (03CR) 10jerkins-bot: [V: 04-1] add logstash2030 and logstash2031 [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [23:05:28] (03CR) 10Dzahn: [C: 04-2] add logstash2030 and logstash2031 [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [23:07:35] (03PS3) 10Dzahn: add logstash2030 and logstash2031 [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) [23:08:43] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:10:04] (03CR) 10Dzahn: [C: 04-2] "logstash1020-22 and logstash2020-22 have IPv6 records but all others do not. should they all have them?" [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [23:10:15] (03CR) 10Dzahn: "logstash1020-22 and logstash2020-22 have IPv6 records but all others do not. should they all have them?" [dns] - 10https://gerrit.wikimedia.org/r/607634 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [23:28:47] (03PS1) 10Dzahn: releases::mediawiki:: support buster / PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) [23:30:41] 10Operations, 10Performance-Team, 10Traffic: Send peering requests to AS with the worst TTFB - https://phabricator.wikimedia.org/T219486 (10faidon) 05Open→03Resolved I took a look at that list above. It's really not very actionable -- most of these are very large networks that have a restrictive settleme... [23:37:23] (03CR) 10Volans: "> Patch Set 3:" [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [23:41:45] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) @hashar For some reason on releases1002/2002 (new VMs on buster), after applying the releases role, one... [23:42:35] RECOVERY - Check systemd state on releases1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:55] RECOVERY - jenkins_service_running on releases1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [23:43:12] !log releases1002 - kill rogue jenkins process, start jenkins with systemctl start jenkins (T247652) [23:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:17] T247652: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 [23:44:09] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:44:46] !log releases2002 - systemctl stop jenkins, kill 15244 (rogue jenkins process), start jenkins with systemctl start jenkins (T247652) [23:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:51] (03CR) 10Dzahn: [C: 04-2] "> Patch Set 3:" [dns] - 10https://gerrit.wikimedia.org/r/607637 (https://phabricator.wikimedia.org/T256139) (owner: 10Dzahn) [23:51:53] RECOVERY - MariaDB Replica Lag: s4 on db1145 is OK: OK slave_sql_lag Replication lag: 0.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [23:57:04] (03PS1) 10Volans: scripts: unset the face too in the offline script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/607644 [23:57:50] (03PS1) 10Dzahn: admins: add system user for jenkins, reserve UID 903 [puppet] - 10https://gerrit.wikimedia.org/r/607645 (https://phabricator.wikimedia.org/T224591) [23:59:09] (03PS2) 10Dzahn: admins: add system user for jenkins, reserve UID 903 [puppet] - 10https://gerrit.wikimedia.org/r/607645 (https://phabricator.wikimedia.org/T224591)