[01:30:57] (03PS1) 10Zoranzoki21: Add *.archives.go.jp to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553875 (https://phabricator.wikimedia.org/T238476) [01:34:58] Hi, I am sorry, but can anyone merge this https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/553787/ ASAP [01:46:17] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10Papaul) @akosiaris please see @wiki_willy comment above. Thanks. [01:52:26] 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 replaced with WMF6403 - https://phabricator.wikimedia.org/T232126 (10Papaul) 05Open→03Resolved I update Netbox [01:55:48] Reedy? [01:59:35] 10Operations, 10ops-codfw, 10DC-Ops: Move kafka200[123] to logstash202[012] - https://phabricator.wikimedia.org/T235125 (10Papaul) 05Open→03Resolved Update Netxbox [02:00:28] 10Operations, 10ops-codfw, 10ops-eqdfw: scs-[a1-c1]-codfw redundancy power test - https://phabricator.wikimedia.org/T239345 (10Papaul) 05Open→03Resolved Resolving this task since redundancy power is working in codfw [02:09:01] PROBLEM - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [02:51:39] (03PS1) 10Andrew Bogott: codfw1dev: upgrade nova, keystone and glance to Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553876 [03:04:59] (03PS2) 10Andrew Bogott: codfw1dev: upgrade nova, keystone and glance to Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553876 [03:05:01] (03PS1) 10Andrew Bogott: Openstack keystone: add more missing manifests for Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553877 (https://phabricator.wikimedia.org/T237749) [03:05:03] (03PS1) 10Andrew Bogott: Openstack Neutron: add missing files for Openstack Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553878 (https://phabricator.wikimedia.org/T237749) [03:05:59] (03CR) 10jerkins-bot: [V: 04-1] Openstack keystone: add more missing manifests for Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553877 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [03:08:17] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] "overriding lint check" [puppet] - 10https://gerrit.wikimedia.org/r/553877 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [03:08:34] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Neutron: add missing files for Openstack Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553878 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [03:16:24] (03PS3) 10Andrew Bogott: codfw1dev: upgrade nova, keystone and glance to Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553876 [03:16:26] (03PS1) 10Andrew Bogott: Openstack nova: remove unused config files [puppet] - 10https://gerrit.wikimedia.org/r/553879 [03:20:38] (03CR) 10Andrew Bogott: [C: 03+2] Openstack nova: remove unused config files [puppet] - 10https://gerrit.wikimedia.org/r/553879 (owner: 10Andrew Bogott) [04:13:22] (03PS4) 10Andrew Bogott: codfw1dev: upgrade nova, keystone and glance to Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553876 [04:13:24] (03PS1) 10Andrew Bogott: keystone-paste: remove refs to simple_cert_extension [puppet] - 10https://gerrit.wikimedia.org/r/553880 [04:13:26] (03PS1) 10Andrew Bogott: nova config: catch up with a few deprecated options [puppet] - 10https://gerrit.wikimedia.org/r/553881 [04:18:12] (03PS2) 10Andrew Bogott: nova config: catch up with a few deprecated options [puppet] - 10https://gerrit.wikimedia.org/r/553881 [04:18:15] (03PS2) 10Andrew Bogott: keystone-paste: remove refs to simple_cert_extension [puppet] - 10https://gerrit.wikimedia.org/r/553880 [04:18:17] (03PS5) 10Andrew Bogott: codfw1dev: upgrade nova, keystone and glance to Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553876 [04:32:39] (03PS3) 10Andrew Bogott: keystone-paste: remove refs to simple_cert_extension [puppet] - 10https://gerrit.wikimedia.org/r/553880 [04:32:41] (03PS6) 10Andrew Bogott: codfw1dev: upgrade nova, keystone and glance to Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553876 [04:32:43] (03PS1) 10Andrew Bogott: designate: move/clean up some deprecated config options [puppet] - 10https://gerrit.wikimedia.org/r/553882 [04:32:45] (03PS1) 10Andrew Bogott: Neutron l3_agent: remove external_network_bridge config option [puppet] - 10https://gerrit.wikimedia.org/r/553883 [04:33:33] (03CR) 10Andrew Bogott: [C: 03+2] nova config: catch up with a few deprecated options [puppet] - 10https://gerrit.wikimedia.org/r/553881 (owner: 10Andrew Bogott) [04:35:08] (03CR) 10Andrew Bogott: "Arturo -- since this is currently unset it seems like just removing it will be harmless, but I'd like your opinion before I merge." [puppet] - 10https://gerrit.wikimedia.org/r/553883 (owner: 10Andrew Bogott) [04:41:24] hi! someone told me to report here to update the wikimedia blockage in prc [04:41:33] is a staff available? [04:42:26] emfipp_: Its sunday evening so not many folks are around, however, staff people do monitor this channel, and will most likely see whatever you say come monday [04:43:43] I believe 198.35.26.0/23 is blackholed from this endpoint [04:43:56] as nothing ever gets back, even icmp [04:44:53] so domain fronting no longer works if we file nl.wikipedia.org as the Host header inside a TLS connection with nl.wikisource.org as SNI [04:45:15] this is the only subnet tested [04:45:39] and I suspect the blockage will extend to other subnets under as 14907 [04:47:08] and consider it a request: rent some cloudflare/azure/aws service for more domain-fronting [05:52:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318 for compression', diff saved to https://phabricator.wikimedia.org/P9790 and previous config saved to /var/cache/conftool/dbconfig/20191202-055245-marostegui.json [05:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:27] !log Compress db1099:3318 T235599 [05:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:32] T235599: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 [05:55:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075 for schema change', diff saved to https://phabricator.wikimedia.org/P9791 and previous config saved to /var/cache/conftool/dbconfig/20191202-055546-marostegui.json [05:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:19] !log Deploy schema change on db1075 [05:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:23] !log Compress s4 codfw master (lag might appear on codfw s4) [06:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:47] !log Compress s8 codfw master (lag might appear on codfw s8) [06:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:14] (03CR) 10Andrew Bogott: [C: 03+2] designate: move/clean up some deprecated config options [puppet] - 10https://gerrit.wikimedia.org/r/553882 (owner: 10Andrew Bogott) [06:05:34] (03PS1) 10Marostegui: mariadb: Reimage db1107 and place it as core_test [puppet] - 10https://gerrit.wikimedia.org/r/553884 (https://phabricator.wikimedia.org/T238113) [06:08:45] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1003/19695/" [puppet] - 10https://gerrit.wikimedia.org/r/553884 (https://phabricator.wikimedia.org/T238113) (owner: 10Marostegui) [06:18:34] (03PS1) 10Andrew Bogott: Revert "designate: move/clean up some deprecated config options" [puppet] - 10https://gerrit.wikimedia.org/r/553885 [06:19:26] (03CR) 10Andrew Bogott: [C: 03+2] Revert "designate: move/clean up some deprecated config options" [puppet] - 10https://gerrit.wikimedia.org/r/553885 (owner: 10Andrew Bogott) [06:24:06] (03PS7) 10Andrew Bogott: codfw1dev: upgrade nova, keystone and glance to Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553876 [06:24:08] (03PS4) 10Andrew Bogott: keystone-paste: remove refs to simple_cert_extension [puppet] - 10https://gerrit.wikimedia.org/r/553880 [06:24:10] (03PS2) 10Andrew Bogott: Neutron l3_agent: remove external_network_bridge config option [puppet] - 10https://gerrit.wikimedia.org/r/553883 [06:31:27] jouncebot: now [06:31:28] No deployments scheduled for the next 4 hour(s) and 58 minute(s) [06:31:40] jouncebot: next [06:31:41] In 4 hour(s) and 58 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191202T1130) [06:33:55] (03CR) 10Urbanecm: [C: 03+2] "throttle rule needed ASAP" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553787 (https://phabricator.wikimedia.org/T239465) (owner: 10Zoranzoki21) [06:34:45] (03Merged) 10jenkins-bot: Add throttle rule for cawiki workshop on 2019-12-02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553787 (https://phabricator.wikimedia.org/T239465) (owner: 10Zoranzoki21) [06:38:43] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: New throttle rule for cawiki workshop (T239465) (duration: 01m 03s) [06:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:48] T239465: Request: temporary lift of IP cap in cawiki on 2019-12-02 - https://phabricator.wikimedia.org/T239465 [06:43:32] !log Clear account creation throttle for several IPs (T239465) [06:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:05] !log forcing a reboot of cloudstore1008 via mgmt console — it seems to have locked up [07:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:52] (03PS3) 10Ema: ATS: re-use origin server connections for matching IPs [puppet] - 10https://gerrit.wikimedia.org/r/553490 (https://phabricator.wikimedia.org/T238494) [07:44:19] (03PS1) 10Muehlenhoff: Remove access for dfoy [puppet] - 10https://gerrit.wikimedia.org/r/553893 [07:47:45] (03PS2) 10Marostegui: mariadb: Reimage db1107 and place it as core_test [puppet] - 10https://gerrit.wikimedia.org/r/553884 (https://phabricator.wikimedia.org/T238113) [07:48:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db1107 and place it as core_test [puppet] - 10https://gerrit.wikimedia.org/r/553884 (https://phabricator.wikimedia.org/T238113) (owner: 10Marostegui) [07:50:18] (03PS2) 10Muehlenhoff: Remove access for dfoy [puppet] - 10https://gerrit.wikimedia.org/r/553893 [07:53:14] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for dfoy [puppet] - 10https://gerrit.wikimedia.org/r/553893 (owner: 10Muehlenhoff) [07:58:11] (03PS1) 10Ema: ATS: add trafficserver_backend_connections_count [puppet] - 10https://gerrit.wikimedia.org/r/553894 (https://phabricator.wikimedia.org/T238494) [08:00:09] (03CR) 10jerkins-bot: [V: 04-1] ATS: add trafficserver_backend_connections_count [puppet] - 10https://gerrit.wikimedia.org/r/553894 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [08:00:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:34] 10Operations, 10Traffic: Improve ATS prometheus metrics - https://phabricator.wikimedia.org/T231533 (10ema) 05Open→03Resolved a:03ema The version of trafficserver-prometheus-exporter running on the cluster supports custom metrics, and we ship our own metric files. Closing! [08:03:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:22] (03PS2) 10Ema: ATS: add trafficserver_backend_connections_count [puppet] - 10https://gerrit.wikimedia.org/r/553894 (https://phabricator.wikimedia.org/T238494) [08:06:28] (03PS1) 10Marostegui: control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/553895 [08:07:31] (03CR) 10Marostegui: [C: 04-2] "Not ready yet" [software] - 10https://gerrit.wikimedia.org/r/553895 (owner: 10Marostegui) [08:08:37] !log reimage mw1301.eqiad.wmnet [08:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:08] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1301.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201912020808_jiji_185958.log`. [08:12:10] (03PS8) 10Andrew Bogott: codfw1dev: upgrade nova, keystone and glance to Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553876 [08:12:12] (03PS5) 10Andrew Bogott: keystone-paste: remove refs to simple_cert_extension [puppet] - 10https://gerrit.wikimedia.org/r/553880 [08:12:14] (03PS3) 10Andrew Bogott: Neutron l3_agent: remove external_network_bridge config option [puppet] - 10https://gerrit.wikimedia.org/r/553883 [08:12:16] (03PS1) 10Andrew Bogott: nova: further reduce number of active workers [puppet] - 10https://gerrit.wikimedia.org/r/553896 [08:13:44] (03CR) 10Andrew Bogott: [C: 03+2] nova: further reduce number of active workers [puppet] - 10https://gerrit.wikimedia.org/r/553896 (owner: 10Andrew Bogott) [08:14:37] !log reimage mw1287.eqiad.wmnet mw1288.eqiad.wmnet mw1289.eqiad.wmnet [08:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:20] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1287.eqiad.wmnet', 'mw1288.eqiad.wmnet', 'mw1289.eqiad.wmnet'] ` The log can be found in `/var/log/... [08:15:41] jouncebot: next [08:15:41] In 3 hour(s) and 14 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191202T1130) [08:18:09] ACKNOWLEDGEMENT - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel experimenting with data reload https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:09] (03PS1) 10DannyS712: Create translation namespace on nap.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553929 (https://phabricator.wikimedia.org/T239547) [08:22:14] 10Operations, 10Traffic: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360 (10ema) 05Open→03Invalid >>! In T161360#5701799, @Aklapper wrote: > @ema: I don't understand how a task about an issue which happened 30 months ago and we're unsure if there is still a problem can have... [08:23:38] (03PS2) 10DannyS712: Create translation namespace on nap.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553929 (https://phabricator.wikimedia.org/T239547) [08:28:17] (03PS1) 10Marostegui: packages_wmf.pp: Install libjemalloc2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/554014 [08:28:53] (03CR) 10jerkins-bot: [V: 04-1] packages_wmf.pp: Install libjemalloc2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/554014 (owner: 10Marostegui) [08:30:17] (03PS2) 10Marostegui: packages_wmf.pp: Install libjemalloc2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/554014 [08:30:54] (03CR) 10jerkins-bot: [V: 04-1] packages_wmf.pp: Install libjemalloc2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/554014 (owner: 10Marostegui) [08:31:15] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [08:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:00] (03PS3) 10Marostegui: packages_wmf.pp: Install libjemalloc2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/554014 [08:33:24] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:24] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [08:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:31] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:29] 10Operations, 10MediaWiki-API, 10Traffic, 10observability, and 2 others: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854 (10Joe) 05Open→03Resolved a:03Joe This has been resolved for some time: https://grafana.wikimedia.org/d/RIA1lzDZk/application-s... [08:47:50] (03CR) 10Marostegui: [C: 03+2] packages_wmf.pp: Install libjemalloc2 on buster [puppet] - 10https://gerrit.wikimedia.org/r/554014 (owner: 10Marostegui) [08:58:33] (03CR) 10Filippo Giunchedi: "LGTM, modulo metric name" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553894 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [09:08:57] (03PS1) 10Elukey: Add a cookbook to roll restart the AQS nodejs service [cookbooks] - 10https://gerrit.wikimedia.org/r/554023 [09:09:30] (03PS3) 10Ema: ATS: add trafficserver_backend_connections_total [puppet] - 10https://gerrit.wikimedia.org/r/553894 (https://phabricator.wikimedia.org/T238494) [09:09:40] (03PS2) 10Elukey: Add a cookbook to roll restart the AQS nodejs service [cookbooks] - 10https://gerrit.wikimedia.org/r/554023 [09:10:22] (03CR) 10Filippo Giunchedi: [C: 03+1] ATS: add trafficserver_backend_connections_total [puppet] - 10https://gerrit.wikimedia.org/r/553894 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [09:10:50] the consideration was prc could have blackholed the subnets [09:10:59] (03PS3) 10Elukey: Add a cookbook to roll restart the AQS nodejs service [cookbooks] - 10https://gerrit.wikimedia.org/r/554023 [09:11:10] should the ssl gateway deploy tls 1.3, which encrypts sni [09:11:31] now that these folks are blackholing subnets anyway [09:11:41] the point is now moot [09:12:04] so please get us tls 1.3 on on wikimedia foundation servers! [09:13:02] (03CR) 10Ema: [C: 03+2] ATS: add trafficserver_backend_connections_total [puppet] - 10https://gerrit.wikimedia.org/r/553894 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [09:14:08] !log extend graphite LVs on graphite1004 / graphite2003 by 200G [09:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:46] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1301.eqiad.wmnet'] ` and were **ALL** successful. [09:16:07] !log installing libvpx security updates [09:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:33] (03CR) 10Jcrespo: "-1, the dependency should be removed, as it is already declared on the package. This was still here while there were jessie packages aroun" [puppet] - 10https://gerrit.wikimedia.org/r/554014 (owner: 10Marostegui) [09:19:08] (03CR) 10Marostegui: [C: 03+2] "> -1, the dependency should be removed, as it is already declared on" [puppet] - 10https://gerrit.wikimedia.org/r/554014 (owner: 10Marostegui) [09:22:21] !log joal@deploy1001 Started deploy [analytics/refinery@8991301]: Regular analytics deploy - late from last week [09:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:46] !log reimage mw1300.eqiad.wmne [09:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:53] !log reimage mw1300.eqiad.wmnet [09:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:20] (03PS1) 10Marostegui: packages_wmf.pp: Remove libaio1, libjemalloc from requires [puppet] - 10https://gerrit.wikimedia.org/r/554027 [09:24:23] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1300.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201912020924_jiji_202065.log`. [09:24:26] PROBLEM - Host mw2223 is DOWN: PING CRITICAL - Packet loss = 100% [09:25:10] RECOVERY - Host mw2223 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [09:33:28] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/19699/db1089.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/554027 (owner: 10Marostegui) [09:40:44] !log joal@deploy1001 Finished deploy [analytics/refinery@8991301]: Regular analytics deploy - late from last week (duration: 18m 22s) [09:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:11] !log joal@deploy1001 Started deploy [analytics/refinery@8991301] (thin): Regular analytics deploy - late from last week (thin) [09:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:19] !log joal@deploy1001 Finished deploy [analytics/refinery@8991301] (thin): Regular analytics deploy - late from last week (thin) (duration: 00m 08s) [09:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:36] (03PS2) 10Mathew.onipe: query_service: use the correct script for autodeployment [puppet] - 10https://gerrit.wikimedia.org/r/553063 [09:42:07] (03CR) 10Muehlenhoff: "Same probably also for libaio1?" [puppet] - 10https://gerrit.wikimedia.org/r/554014 (owner: 10Marostegui) [09:42:33] (03CR) 10Marostegui: [C: 03+2] "> Same probably also for libaio1?" [puppet] - 10https://gerrit.wikimedia.org/r/554014 (owner: 10Marostegui) [09:45:20] 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 replaced with WMF6403 - https://phabricator.wikimedia.org/T232126 (10Volans) @Papaul thanks, just a small detail, I've deleted also the 'mgmt' interface from 'mw2231 old' ( https://netbox.wikimedia.org/dcim/devices/1185/ ) given that it's offline (unra... [09:45:43] (03CR) 10Gehel: [C: 03+2] query_service: use the correct script for autodeployment [puppet] - 10https://gerrit.wikimedia.org/r/553063 (owner: 10Mathew.onipe) [09:46:34] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/554027 (owner: 10Marostegui) [09:49:05] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:24] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:57] (03CR) 10Marostegui: [C: 03+2] packages_wmf.pp: Remove libaio1, libjemalloc from requires [puppet] - 10https://gerrit.wikimedia.org/r/554027 (owner: 10Marostegui) [09:50:31] (03CR) 10Jcrespo: [C: 03+1] packages_wmf.pp: Remove libaio1, libjemalloc from requires [puppet] - 10https://gerrit.wikimedia.org/r/554027 (owner: 10Marostegui) [09:51:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:03] !log swift eqiad-prod: more weight to ms-be105[7-9] - T237438 [09:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:08] T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 [10:02:25] jouncebot: next [10:02:26] In 1 hour(s) and 27 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191202T1130) [10:02:46] PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:05:33] (03CR) 10Ema: [C: 04-1] Public cache routing for eventgate-logging-external (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [10:07:44] downtime expired --^ [10:08:04] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on mw2223 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [10:08:13] (03CR) 10Volans: "LGTM, one possible improvement inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/554023 (owner: 10Elukey) [10:08:20] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=mw2229.codfw.wmnet [10:08:20] elukey: <3 for the cookbook! [10:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:13] 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 replaced with WMF6403 - https://phabricator.wikimedia.org/T232126 (10Volans) Forgot to mention that https://netbox.wikimedia.org/ipam/ip-addresses/687/ had still the old name, I've updated it. [10:15:36] !log installing file/libmagic regresssion update for jessie [10:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:56] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10jbond) p:05Triage→03Normal [10:18:12] !log reimage mw1275.eqiad.wmnet [10:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:47] !log reimage mw1290.eqiad.wmnet [10:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:32] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1275.eqiad.wmnet', 'mw1290.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191... [10:23:27] PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100% [10:23:36] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10jbond) @Rxy Before enabling access we will first need to ensure we have a valid signed [[ https://wikitech.wikimedia.org/wiki/Volunteer_NDA | NDA ]] on record for you. @RStallman-legalte... [10:24:28] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) And `[10:23:27] <+icinga-wm> PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100%` which already failed: T239041 [10:25:06] 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10Marostegui) 05Resolved→03Open This host went down again: ` And [10:23:27] <+icinga-wm> PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100% ` [10:25:09] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [10:26:18] (03CR) 10Elukey: Add a cookbook to roll restart the AQS nodejs service (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/554023 (owner: 10Elukey) [10:26:40] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10Urbanecm) Support. [10:27:08] 10Operations, 10Release-Engineering-Team-TODO, 10Jenkins, 10Release-Engineering-Team (CI & Testing services): Add latest jenkins debian packages to apt.wikimedia.org and upgrade jenkins to latest LTS (2.190.3) - https://phabricator.wikimedia.org/T239586 (10hashar) [10:29:37] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1300.eqiad.wmnet'] ` and were **ALL** successful. [10:37:43] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554033 (https://phabricator.wikimedia.org/T128546) [10:40:45] (03PS1) 10Jbond: "package_builder: clean up build and results directory"" [puppet] - 10https://gerrit.wikimedia.org/r/554035 [10:42:38] !log reimage mw1299.eqiad.wmnet [10:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:43] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [10:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:54] (03PS4) 10Elukey: Add a cookbook to roll restart the AQS nodejs service [cookbooks] - 10https://gerrit.wikimedia.org/r/554023 [10:43:01] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1299.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201912021042_jiji_219639.log`. [10:43:26] !log installing python-psutil security updates [10:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:55] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:10] (03PS2) 10Jbond: "package_builder: clean up build and results directory"" [puppet] - 10https://gerrit.wikimedia.org/r/554035 [10:46:34] (03PS3) 10Jbond: package_builder: clean up build and results directory [puppet] - 10https://gerrit.wikimedia.org/r/554035 (https://phabricator.wikimedia.org/T237713) [10:47:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554035 (https://phabricator.wikimedia.org/T237713) (owner: 10Jbond) [10:47:46] (03CR) 10Jbond: [C: 03+2] package_builder: clean up build and results directory [puppet] - 10https://gerrit.wikimedia.org/r/554035 (https://phabricator.wikimedia.org/T237713) (owner: 10Jbond) [10:52:38] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [10:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:48] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:19] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [10:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:08] (03PS1) 10Alexandros Kosiaris: kube-proxy: Allow overriding metrics bind address [puppet] - 10https://gerrit.wikimedia.org/r/554036 [10:58:26] (03PS1) 10Jbond: package_builder: add leading 0 to time spec [puppet] - 10https://gerrit.wikimedia.org/r/554037 [10:59:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:32] (03CR) 10Jbond: [C: 03+2] apereo_cas: add localhost to list of allowed prometheus scrappers [puppet] - 10https://gerrit.wikimedia.org/r/553750 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [11:01:11] (03CR) 10Jbond: [C: 03+2] package_builder: add leading 0 to time spec [puppet] - 10https://gerrit.wikimedia.org/r/554037 (owner: 10Jbond) [11:03:27] !log installing ruby2.1 security updates [11:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:41] (03PS2) 10Alexandros Kosiaris: kube-proxy: Allow overriding metrics bind address [puppet] - 10https://gerrit.wikimedia.org/r/554036 [11:03:43] (03PS1) 10Alexandros Kosiaris: prometheus: Scrape kube-proxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/554038 [11:05:07] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:43] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Volans) I've updated the mgmt interface's DNS names on Netbox that were still reporting the old names `cloud... [11:12:52] PROBLEM - Check the last execution of package_builder: Clean up build directory on boron is CRITICAL: NRPE: Command check_check_package_builder: not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:14:42] (03PS1) 10Jbond: package_builder: cant use spaces in job title [puppet] - 10https://gerrit.wikimedia.org/r/554040 [11:15:14] PROBLEM - Check the last execution of package_builder: Clean up result directory on boron is CRITICAL: NRPE: Command check_check_package_builder: not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:15:46] PROBLEM - DPKG on kubestagetcd1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:15:51] (03CR) 10Jbond: [C: 03+2] package_builder: cant use spaces in job title [puppet] - 10https://gerrit.wikimedia.org/r/554040 (owner: 10Jbond) [11:20:57] 10Operations, 10observability, 10User-fgiunchedi: Reimage wezen to Stretch or Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10Volans) I've updated the mgmt DNS name record in Netbox that was still reporting wezen. I've also a patch to cleanup the wezen record from DNS, wi... [11:23:54] (03PS1) 10Arturo Borrero Gonzalez: dynamicproxy: add backend information to access log entries [puppet] - 10https://gerrit.wikimedia.org/r/554041 (https://phabricator.wikimedia.org/T238641) [11:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191202T1130). [11:31:06] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554033 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:31:56] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554033 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:32:47] (03CR) 10Arturo Borrero Gonzalez: "Please double check that I'm not breaking something by changing the log format." [puppet] - 10https://gerrit.wikimedia.org/r/554041 (https://phabricator.wikimedia.org/T238641) (owner: 10Arturo Borrero Gonzalez) [11:34:16] (03CR) 10Jbond: [C: 03+2] profile::prometheus::ops: add scraper for apero_cas idp service [puppet] - 10https://gerrit.wikimedia.org/r/553743 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [11:37:12] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:554033| Bumping portals to master (T128546)]] (duration: 01m 04s) [11:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:17] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:38:13] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:554033| Bumping portals to master (T128546)]] (duration: 01m 00s) [11:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:15] (03PS9) 10Andrew Bogott: codfw1dev: upgrade nova, keystone and glance to Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553876 [11:41:44] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: upgrade nova, keystone and glance to Ocata [puppet] - 10https://gerrit.wikimedia.org/r/553876 (owner: 10Andrew Bogott) [11:46:47] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1287.eqiad.wmnet', 'mw1288.eqiad.wmnet', 'mw1289.eqiad.wmnet'] ` and were **ALL** successful. [11:46:54] (03PS4) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) [11:46:56] (03PS1) 10Filippo Giunchedi: role: extend centrallog's /srv if needed [puppet] - 10https://gerrit.wikimedia.org/r/554044 (https://phabricator.wikimedia.org/T156955) [11:47:47] (03PS1) 10Jbond: profile::prometheus::ops: add tls cname [puppet] - 10https://gerrit.wikimedia.org/r/554045 (https://phabricator.wikimedia.org/T233934) [11:48:34] Hallo [11:48:57] Is this a good place to ask about WMF-JobQueue? [11:49:16] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1299.eqiad.wmnet'] ` and were **ALL** successful. [11:49:16] There's this bug, and I'm not sure who to ping about it: https://phabricator.wikimedia.org/T239394 [11:50:56] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=mw2229.codfw.wmnet [11:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:30] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10fgiunchedi) [11:51:34] (03PS6) 10Andrew Bogott: keystone-paste: remove refs to simple_cert_extension [puppet] - 10https://gerrit.wikimedia.org/r/553880 [11:51:36] (03PS4) 10Andrew Bogott: Neutron l3_agent: remove external_network_bridge config option [puppet] - 10https://gerrit.wikimedia.org/r/553883 [11:51:38] (03PS1) 10Andrew Bogott: openstack admin_scripts: stop installing wmcs-nova-quota-sync [puppet] - 10https://gerrit.wikimedia.org/r/554046 [11:53:13] (03CR) 10Andrew Bogott: [C: 03+2] openstack admin_scripts: stop installing wmcs-nova-quota-sync [puppet] - 10https://gerrit.wikimedia.org/r/554046 (owner: 10Andrew Bogott) [11:54:40] (03PS2) 10Jbond: profile::prometheus::ops: add tls server name [puppet] - 10https://gerrit.wikimedia.org/r/554045 (https://phabricator.wikimedia.org/T233934) [11:56:33] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:30] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Provide authenticated access to Prometheus native web interface - https://phabricator.wikimedia.org/T151009 (10jbond) Im tempted to add this directly to apereo cas (time permitting) however im curious what you had in mind for the service... [11:58:41] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191202T1200). [12:00:05] kostajh and Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:24] once you'll are done, I have something to deploy [12:00:30] PROBLEM - mediawiki-installation DSH group on mw1288 is CRITICAL: Host mw1288 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [12:01:23] o/ [12:01:49] I’m only free for 45 minutes, but I can start the SWAT [12:01:59] * Lucas_WMDE looks at kostajh’s change [12:03:08] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::prometheus::ops: add tls server name [puppet] - 10https://gerrit.wikimedia.org/r/554045 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [12:03:23] Lucas_WMDE: Hi, sorry, I'm here now [12:03:30] I don’t fully understand it, but I guess since it’s a backport I don’t have to :) [12:04:58] Lucas_WMDE: tl;dr, we load page view data on "suggested edits" cards, but we don't want the absence of that data to cause an error and result in the card not showing. The backport patch fixes that on wmf.8 [12:05:24] do you know how long gate-and-submit usually takes for that extension? [12:05:33] zuul says 6min, hm [12:06:03] Umm, more like 20 minutes IIRC [12:06:14] ok [12:06:37] I was thinking about running the maintenance script mentioned in the deploy calendar in the meantime [12:06:43] but it looks like Zoranzoki21 isn’t here yet [12:06:48] Oh, or ~9 minutes. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/553412#message-6df678e2a0c97e08ad5626d0aed9a5fbc826b5d3 [12:06:52] so let’s just wait for now [12:07:14] Cool, no rush on my part [12:07:43] (03CR) 10Jbond: [C: 03+2] profile::prometheus::ops: add tls server name [puppet] - 10https://gerrit.wikimedia.org/r/554045 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [12:07:52] will you be able to test your backport on mwdebug? [12:07:58] or is it an error that only happens occasionally? [12:08:40] Lucas_WMDE: yes I can test it [12:08:48] great :) [12:09:19] Lucas_WMDE: I'm not actively looking at this channel so just mention me when you're ready for me to test it. danke! [12:09:26] will do! [12:16:54] kostajh: your change should be on mwdebug1001, please test :) [12:17:01] Lucas_WMDE: cool, looking [12:17:28] Lucas_WMDE: I see you're here only for some time, feel free to ping me if you need to hand-over the window [12:17:41] well, unless Zoranzoki21 shows up there won’t be anything else to deploy [12:17:44] but thanks! [12:17:55] Lucas_WMDE: looks good to me [12:17:59] ok, syncing [12:19:50] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/GrowthExperiments/: SWAT: [[gerrit:553402|Suggested edits: do not treat AQS lookup failure as error (T238178)]] (duration: 01m 02s) [12:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:55] T238178: Newcomer tasks: pageview count not appearing or inconsistent - https://phabricator.wikimedia.org/T238178 [12:20:14] (03CR) 10Muehlenhoff: install_server: apply standard partman recipes, take #1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [12:20:40] Amir1: feel free to go ahead with your deploy [12:20:42] 10Operations, 10SRE-swift-storage, 10observability: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10fgiunchedi) AFAICS through the latest rebalances we haven't observed any alerts, possibly also due to using multiple servers per port (T222366) [12:20:52] Thanks! [12:23:06] Lucas_WMDE: okay. [12:23:19] (03PS1) 10Ladsgroup: Set read new for term store for items of wikidata up to Q1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554049 (https://phabricator.wikimedia.org/T225057) [12:23:21] Amir1: once you're done, please let me know, I'll add some last-time things into the calendar [12:23:42] Sure [12:23:59] (03CR) 10Ladsgroup: [C: 03+2] Set read new for term store for items of wikidata up to Q1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554049 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [12:24:47] (03Merged) 10jenkins-bot: Set read new for term store for items of wikidata up to Q1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554049 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [12:26:12] 10Operations, 10Patch-For-Review: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955 (10fgiunchedi) [12:28:38] works fine at mwdebug1002. moving forward [12:29:18] mwdebug1002? [12:29:41] not 1001? [12:29:57] the issue has been resolved with eating errors [12:30:17] oh, ok [12:30:45] I thought it was still down because I couldn’t connect to it earlier, but I think that was just because it wasn’t in my known_hosts [12:31:09] yeah, it got reimaged to buster recently [12:31:37] marostegui: I just dpeloyed something that might increase read on s8 for wb_terms stuff [12:31:46] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:554049|Set read new for term store for items of wikidata up to Q1000 (T225057)]] (duration: 01m 00s) [12:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:52] T225057: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225057 [12:32:56] Urbanecm: I'm done for now [12:33:16] still no Zoranzoki21 to be seen [12:33:36] should we keep the SWAT open or close it? [12:33:38] Amir1: ack, thanks [12:33:58] Lucas_WMDE: I'm going to do some stuff [12:34:04] oh right [12:34:06] ok [12:34:16] (03CR) 10Urbanecm: [C: 03+2] Revert "Change bawiki logo to an anniversary one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547875 (https://phabricator.wikimedia.org/T237070) (owner: 10Urbanecm) [12:35:43] (03PS4) 10Urbanecm: Enable partial blocks on eswiki and scowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) (owner: 10DannyS712) [12:36:51] (03Merged) 10jenkins-bot: Revert "Change bawiki logo to an anniversary one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547875 (https://phabricator.wikimedia.org/T237070) (owner: 10Urbanecm) [12:37:23] !log reimage mw1298.eqiad.wmnet [12:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:37] !log reimage mw1296.eqiad.wmnet [12:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:18] (03CR) 10Urbanecm: "> Patch Set 3: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) (owner: 10DannyS712) [12:38:41] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1296.eqiad.wmnet', 'mw1298.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191... [12:39:35] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: 61a9563: Revert "Change bawiki logo to an anniversary one" (T237070) (duration: 01m 06s) [12:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:41] T237070: Revert bawiki's anniversary logo to normal one - https://phabricator.wikimedia.org/T237070 [12:40:32] jouncebot: next [12:40:32] In 5 hour(s) and 19 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191202T1800) [12:40:36] (03PS6) 10Urbanecm: Remove `move-rootuserpages` from user on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552374 (https://phabricator.wikimedia.org/T238842) (owner: 10DannyS712) [12:40:42] (03CR) 10Urbanecm: [C: 03+2] Remove `move-rootuserpages` from user on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552374 (https://phabricator.wikimedia.org/T238842) (owner: 10DannyS712) [12:42:06] (03Merged) 10jenkins-bot: Remove `move-rootuserpages` from user on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552374 (https://phabricator.wikimedia.org/T238842) (owner: 10DannyS712) [12:43:16] !log Purge https://en.wikipedia.org/static/images/project-logos/bawiki*.png [12:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:40] (03PS5) 10Urbanecm: Enable partial blocks on eswiki and scowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) (owner: 10DannyS712) [12:45:42] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 445bdc3: Remove `move-rootuserpages` from user on svwiki (T238842) (duration: 01m 04s) [12:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:47] T238842: Remove move-rootuserpages from user on svwiki - https://phabricator.wikimedia.org/T238842 [12:46:20] (03PS6) 10Urbanecm: Enable partial blocks on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) (owner: 10DannyS712) [12:46:31] (03PS7) 10Urbanecm: Enable partial blocks on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) (owner: 10DannyS712) [12:46:43] (03CR) 10Urbanecm: [C: 03+2] Enable partial blocks on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) (owner: 10DannyS712) [12:46:57] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1275.eqiad.wmnet', 'mw1290.eqiad.wmnet'] ` and were **ALL** successful. [12:47:40] (03Merged) 10jenkins-bot: Enable partial blocks on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) (owner: 10DannyS712) [12:50:04] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: d27fe78: Enable partial blocks on eswiki (T239370) (duration: 01m 00s) [12:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:10] T239370: Enable Partial Blocks on Spanish Wikipedia - https://phabricator.wikimedia.org/T239370 [12:53:25] (03PS1) 10BBlack: dns3002: include authserver [puppet] - 10https://gerrit.wikimedia.org/r/554051 [12:53:27] (03PS1) 10BBlack: dns3002: include in authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/554052 [12:53:29] (03PS1) 10BBlack: dns2002: include authserver [puppet] - 10https://gerrit.wikimedia.org/r/554053 [12:53:31] (03PS1) 10BBlack: dns2002: include in authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/554054 [12:54:26] !log EU SWAT done [12:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:39] !log mobrovac@deploy1001 Started deploy [restbase/deploy@eedba38] (dev-cluster): Parsoid Proxy: Fixes [12:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:12] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [12:57:33] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@eedba38] (dev-cluster): Parsoid Proxy: Fixes (duration: 02m 54s) [12:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:31] !log mobrovac@deploy1001 Started deploy [restbase/deploy@eedba38]: Parsoid Proxy: Fixes - T229015 [12:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:36] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [13:00:46] RECOVERY - mediawiki-installation DSH group on mw1288 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:01:51] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:00] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:17] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/references/{title} (Get references of a test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:08:15] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:10:37] (03CR) 10BBlack: [C: 03+2] dns3002: include authserver [puppet] - 10https://gerrit.wikimedia.org/r/554051 (owner: 10BBlack) [13:13:29] PROBLEM - Recursive DNS on 2620:0:862:1:91:198:174:62 is CRITICAL: Return code of 255 is out of bounds https://wikitech.wikimedia.org/wiki/DNS [13:13:57] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns3002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:14:20] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@eedba38]: Parsoid Proxy: Fixes - T229015 (duration: 14m 49s) [13:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:26] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [13:14:45] RECOVERY - Recursive DNS on 2620:0:862:1:91:198:174:62 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [13:15:19] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns3002 is OK: OK: UP (pid=119859) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:17:11] (03CR) 10BBlack: [C: 03+2] dns3002: include in authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/554052 (owner: 10BBlack) [13:17:57] (03PS1) 10ArielGlenn: fix iteration in dumps rsync script [puppet] - 10https://gerrit.wikimedia.org/r/554058 (https://phabricator.wikimedia.org/T239590) [13:19:57] (03CR) 10ArielGlenn: [C: 03+2] fix iteration in dumps rsync script [puppet] - 10https://gerrit.wikimedia.org/r/554058 (https://phabricator.wikimedia.org/T239590) (owner: 10ArielGlenn) [13:20:41] (03CR) 10BBlack: [C: 03+2] dns2002: include authserver [puppet] - 10https://gerrit.wikimedia.org/r/554053 (owner: 10BBlack) [13:25:30] (03CR) 10BBlack: [C: 03+2] dns2002: include in authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/554054 (owner: 10BBlack) [13:26:19] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for project GLOW - https://phabricator.wikimedia.org/T238607 (10Moushira) Thank you. Would you kindly keep the list closed, so that it is not listed? It should be by invitation only. Thanks [13:27:45] (03PS3) 10Alexandros Kosiaris: kube-proxy: Allow overriding metrics bind address [puppet] - 10https://gerrit.wikimedia.org/r/554036 [13:27:47] (03PS2) 10Alexandros Kosiaris: prometheus: Scrape kube-proxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/554038 [13:27:49] (03PS1) 10Alexandros Kosiaris: prometheus: Allow keeping FQDNs as targets [puppet] - 10https://gerrit.wikimedia.org/r/554059 [13:27:51] (03PS1) 10Alexandros Kosiaris: calico: Keep FQDNs for calico felix prometheus targets [puppet] - 10https://gerrit.wikimedia.org/r/554060 [13:28:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:30:19] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Allow keeping FQDNs as targets [puppet] - 10https://gerrit.wikimedia.org/r/554059 (owner: 10Alexandros Kosiaris) [13:30:39] (03PS1) 10Ema: ATS: revise atsbackend.mtail ttfb buckets [puppet] - 10https://gerrit.wikimedia.org/r/554062 (https://phabricator.wikimedia.org/T238494) [13:30:52] !log Restarted CI Jenkins [13:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:28] (03PS1) 10BBlack: dns[15]002: include authserver [puppet] - 10https://gerrit.wikimedia.org/r/554064 (https://phabricator.wikimedia.org/T98006) [13:33:29] (03PS1) 10BBlack: dns[15]002: include in authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/554065 (https://phabricator.wikimedia.org/T98006) [13:35:36] (03CR) 10BBlack: [C: 03+2] dns[15]002: include authserver [puppet] - 10https://gerrit.wikimedia.org/r/554064 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [13:40:41] (03CR) 10BBlack: [C: 03+2] dns[15]002: include in authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/554065 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [13:41:46] (03PS2) 10Ema: ATS: revise atsbackend.mtail ttfb buckets [puppet] - 10https://gerrit.wikimedia.org/r/554062 (https://phabricator.wikimedia.org/T238494) [13:42:29] (03CR) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [13:43:22] (03PS1) 10ArielGlenn: explicitly start rpc.statd service on dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/554066 (https://phabricator.wikimedia.org/T239401) [13:43:22] Amir1: thanks I'm going to keep an eye, have you noticed something on the graphs? [13:43:53] marostegui: not yet, but also it takes some time to take effect [13:44:13] !log Restarted CI Jenkins [13:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:10] anybody already checking mw fatals? [13:47:41] !log power-cycle cp3053 T239041 [13:47:42] (03PS5) 10Filippo Giunchedi: install_server: standard recipe and raid1/raid10 [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) [13:47:44] (03PS5) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) [13:47:46] (03PS2) 10Filippo Giunchedi: WIP role: extend centrallog's /srv if needed [puppet] - 10https://gerrit.wikimedia.org/r/554044 (https://phabricator.wikimedia.org/T156955) [13:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:47] T239041: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 [13:48:21] (03CR) 10Filippo Giunchedi: [C: 03+1] ATS: revise atsbackend.mtail ttfb buckets [puppet] - 10https://gerrit.wikimedia.org/r/554062 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [13:48:37] (03PS1) 10BBlack: dns[35]001: include authserver [puppet] - 10https://gerrit.wikimedia.org/r/554067 [13:48:40] (03PS1) 10BBlack: dns[35]001: include in authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/554068 [13:51:07] RECOVERY - Host cp3053 is UP: PING OK - Packet loss = 0%, RTA = 83.47 ms [13:54:24] (03CR) 10Ema: [C: 03+2] ATS: revise atsbackend.mtail ttfb buckets [puppet] - 10https://gerrit.wikimedia.org/r/554062 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [13:55:29] (03PS2) 10BBlack: dns[35]001: include authserver [puppet] - 10https://gerrit.wikimedia.org/r/554067 (https://phabricator.wikimedia.org/T98006) [13:55:31] (03PS2) 10BBlack: dns[35]001: include in authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/554068 (https://phabricator.wikimedia.org/T98006) [13:55:44] (03PS2) 10ArielGlenn: explicitly start rpc.statd service on dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/554066 (https://phabricator.wikimedia.org/T239401) [13:55:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1075 after schema change', diff saved to https://phabricator.wikimedia.org/P9793 and previous config saved to /var/cache/conftool/dbconfig/20191202-135543-marostegui.json [13:55:46] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Provide authenticated access to Prometheus native web interface - https://phabricator.wikimedia.org/T151009 (10fgiunchedi) >>! In T151009#5704732, @jbond wrote: > Im tempted to add this directly to apereo cas (time permitting) however im... [13:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:48] (03PS1) 10Jbond: apere_cas: add local host and tag prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/554070 [13:55:50] 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10MoritzMuehlenhoff) As mentioned in last week's SRE meeting, let's upgrade the firmware to the latest revision cpn cp3053? [13:56:26] (03CR) 10jerkins-bot: [V: 04-1] explicitly start rpc.statd service on dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/554066 (https://phabricator.wikimedia.org/T239401) (owner: 10ArielGlenn) [13:56:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078 for schema change', diff saved to https://phabricator.wikimedia.org/P9794 and previous config saved to /var/cache/conftool/dbconfig/20191202-135643-marostegui.json [13:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:28] (03CR) 10jerkins-bot: [V: 04-1] apere_cas: add local host and tag prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/554070 (owner: 10Jbond) [13:58:22] (03PS3) 10ArielGlenn: explicitly start rpc.statd service on dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/554066 (https://phabricator.wikimedia.org/T239401) [13:58:58] (03PS2) 10Jbond: apere_cas: add local host and tag prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/554070 [13:59:04] (03CR) 10BBlack: [C: 03+2] dns[35]001: include authserver [puppet] - 10https://gerrit.wikimedia.org/r/554067 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [14:01:21] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/553895 (owner: 10Marostegui) [14:01:47] (03Merged) 10jenkins-bot: control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/553895 (owner: 10Marostegui) [14:01:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, per https://mgysel.ch/monitor-spring-boot-metrics-using-prometheus-and-grafana/ that should do the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/554070 (owner: 10Jbond) [14:02:33] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [14:02:42] (03CR) 10Jbond: [C: 03+2] apere_cas: add local host and tag prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/554070 (owner: 10Jbond) [14:02:44] (03CR) 10RLazarus: [C: 03+2] "Oops, I had this all lined up to merge just before 551250, and then when the time came I forgot about it. Putting it in now." [puppet] - 10https://gerrit.wikimedia.org/r/551249 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [14:03:45] jbond42: mind if I merge both? [14:03:51] yes please do [14:04:13] (03CR) 10BBlack: [C: 03+2] dns[35]001: include in authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/554068 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [14:04:43] jbond has lock? [14:04:59] yes there are 3 to mereg now bblak rlazarus and mine. [14:05:04] doing all 3 :) [14:05:09] great thanks [14:05:15] haha I typoed "multiple" and then when I tried again jbond had it [14:05:25] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [14:05:57] thanks bblack [14:08:02] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:19] (03PS4) 10Alexandros Kosiaris: kube-proxy: Allow overriding metrics bind address [puppet] - 10https://gerrit.wikimedia.org/r/554036 [14:08:21] (03PS3) 10Alexandros Kosiaris: prometheus: Scrape kube-proxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/554038 [14:08:23] (03PS2) 10Alexandros Kosiaris: prometheus: Allow keeping FQDNs as targets [puppet] - 10https://gerrit.wikimedia.org/r/554059 [14:08:25] (03PS2) 10Alexandros Kosiaris: calico: Keep FQDNs for calico felix prometheus targets [puppet] - 10https://gerrit.wikimedia.org/r/554060 [14:10:00] (03PS1) 10BBlack: test commit [dns] - 10https://gerrit.wikimedia.org/r/554072 [14:10:04] 10Operations, 10Patch-For-Review: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955 (10fgiunchedi) [14:10:11] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:39] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Allow keeping FQDNs as targets [puppet] - 10https://gerrit.wikimedia.org/r/554059 (owner: 10Alexandros Kosiaris) [14:10:41] (03CR) 10BBlack: [C: 03+2] test commit [dns] - 10https://gerrit.wikimedia.org/r/554072 (owner: 10BBlack) [14:13:43] 10Operations, 10User-jbond: Collects metrics for CAS - https://phabricator.wikimedia.org/T233934 (10fgiunchedi) While talking metrics and such for java, please consider also adding `jmx_exporter` (in addition to the native metrics) to CAS' jvm as we are doing for other JVMs across the fleet in {T177197} [14:13:59] (03PS4) 10ArielGlenn: explicitly start rpc.statd service on dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/554066 (https://phabricator.wikimedia.org/T239401) [14:14:49] (03CR) 10ArielGlenn: [C: 03+2] explicitly start rpc.statd service on dumpsdata servers [puppet] - 10https://gerrit.wikimedia.org/r/554066 (https://phabricator.wikimedia.org/T239401) (owner: 10ArielGlenn) [14:15:03] PROBLEM - PHP opcache health on mw2229 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:15:55] (03PS1) 10BBlack: dns[12]001: include authserver [puppet] - 10https://gerrit.wikimedia.org/r/554073 [14:15:57] (03PS1) 10BBlack: dns[12]001: include in authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/554074 [14:16:38] 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10MoritzMuehlenhoff) @RobH As you offered help in the SRE meeting last Monday, can you upgrade the firmware on cp3053? [14:16:54] (03PS2) 10BBlack: dns[12]001: include authserver [puppet] - 10https://gerrit.wikimedia.org/r/554073 (https://phabricator.wikimedia.org/T98006) [14:16:56] (03PS2) 10BBlack: dns[12]001: include in authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/554074 (https://phabricator.wikimedia.org/T98006) [14:17:37] (03CR) 10Volans: [C: 04-1] "LGTM, one typo though." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/554023 (owner: 10Elukey) [14:18:08] !log set grafana theme back to light, was dark for some reason [14:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:46] (03CR) 10BBlack: [C: 03+2] dns[12]001: include authserver [puppet] - 10https://gerrit.wikimedia.org/r/554073 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [14:19:09] (03CR) 10Elukey: Add a cookbook to roll restart the AQS nodejs service (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/554023 (owner: 10Elukey) [14:20:19] (03PS5) 10Elukey: Add a cookbook to roll restart the AQS nodejs service [cookbooks] - 10https://gerrit.wikimedia.org/r/554023 [14:21:33] RECOVERY - PHP opcache health on mw2229 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:23:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:25:30] 10Operations, 10DBA, 10MediaWiki-General: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10Marostegui) db1107 is now running the latest 10.3.20 from MariaDB replicating from s1 master and db1114 (which r... [14:26:38] (03CR) 10BBlack: [C: 03+2] dns[12]001: include in authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/554074 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [14:28:58] (03CR) 10Ema: [C: 03+2] ATS: re-use origin server connections for matching IPs [puppet] - 10https://gerrit.wikimedia.org/r/553490 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:30:28] (03CR) 10Volans: [C: 03+1] "LGTM, thanks a lot for the cookbook. It would be great if you would be able to test it after merging, to make sure it works as expected." [cookbooks] - 10https://gerrit.wikimedia.org/r/554023 (owner: 10Elukey) [14:31:11] (03CR) 10Elukey: "> LGTM, thanks a lot for the cookbook. It would be great if you would" [cookbooks] - 10https://gerrit.wikimedia.org/r/554023 (owner: 10Elukey) [14:31:12] !log cp-ats: merge server_session_sharing.match=2 (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/553490/) with puppet disabled, test on cp3050 T238494 [14:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:18] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [14:31:26] (03CR) 10Elukey: [C: 03+2] Add a cookbook to roll restart the AQS nodejs service [cookbooks] - 10https://gerrit.wikimedia.org/r/554023 (owner: 10Elukey) [14:32:45] (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: introduce apt pinning configurations [puppet] - 10https://gerrit.wikimedia.org/r/554076 (https://phabricator.wikimedia.org/T239409) [14:33:31] (03PS2) 10ArielGlenn: make cirrussearch dumps write output to a temp file, then move into place [puppet] - 10https://gerrit.wikimedia.org/r/553746 (https://phabricator.wikimedia.org/T238646) [14:33:32] (03CR) 10Jhedden: [C: 03+1] keystone-paste: remove refs to simple_cert_extension [puppet] - 10https://gerrit.wikimedia.org/r/553880 (owner: 10Andrew Bogott) [14:34:26] (03CR) 10ArielGlenn: [C: 03+2] make cirrussearch dumps write output to a temp file, then move into place [puppet] - 10https://gerrit.wikimedia.org/r/553746 (https://phabricator.wikimedia.org/T238646) (owner: 10ArielGlenn) [14:34:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:41:15] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:41:24] (03PS1) 10Volans: zone_validator: better detection of mgmt ORIGINs [dns] - 10https://gerrit.wikimedia.org/r/554078 [14:41:26] (03PS1) 10Volans: frack: add missing asset tag management records [dns] - 10https://gerrit.wikimedia.org/r/554079 (https://phabricator.wikimedia.org/T239597) [14:41:28] (03PS1) 10Volans: eqsin: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554080 (https://phabricator.wikimedia.org/T239597) [14:41:29] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:41:30] (03PS1) 10Volans: eqiad: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554081 (https://phabricator.wikimedia.org/T239597) [14:41:32] (03PS1) 10Volans: codfw: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554082 (https://phabricator.wikimedia.org/T239597) [14:41:34] (03PS1) 10Volans: codfw: remove old wezen record [dns] - 10https://gerrit.wikimedia.org/r/554083 (https://phabricator.wikimedia.org/T224564) [14:41:37] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:41:41] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:41:57] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:42:03] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:42:25] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:43:09] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:43:11] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:44:17] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:44:27] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:45:25] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:45:37] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:45:59] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:46:11] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:46:24] !log cp-ats: set server_session_sharing.match=2 everywhere (puppet re-enable and run) T238494 [14:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:29] T238494: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [14:46:46] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:47:10] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:49:27] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:49:38] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:49:57] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:50:22] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:50:26] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:50:36] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:51:38] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1296.eqiad.wmnet', 'mw1298.eqiad.wmnet'] ` and were **ALL** successful. [14:51:58] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:52:57] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) Where we're at now: * There are 13x authdns servers participating in `authdns-update`: ** The 3 traditional ones (`authdns1001`, `authdns1002`, `ganeti3... [14:53:08] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:53:36] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:53:42] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:58:24] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [14:59:31] (03PS2) 10Jhedden: tools: add qdisc node collector to tools bastion [puppet] - 10https://gerrit.wikimedia.org/r/553815 [15:00:31] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for project GLOW - https://phabricator.wikimedia.org/T238607 (10jbond) @Moushira As the admin you are able to update any settings for the mailing list and should be the responsible party to ensure the settings are correct. However we are here to he... [15:00:38] (03CR) 10Jhedden: [C: 03+2] tools: add qdisc node collector to tools bastion [puppet] - 10https://gerrit.wikimedia.org/r/553815 (owner: 10Jhedden) [15:02:30] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:06:45] (03CR) 10Ottomata: Public cache routing for eventgate-logging-external (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:10:19] (03PS8) 10Herron: logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) [15:10:49] (03PS1) 10Elukey: spicerack.remote: fix docstring return type [software/spicerack] - 10https://gerrit.wikimedia.org/r/554087 [15:12:22] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:12:26] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:12:28] (03CR) 10Ottomata: Public cache routing for eventgate-logging-external (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [15:12:47] (03PS5) 10Alexandros Kosiaris: kube-proxy: Allow overriding metrics bind address [puppet] - 10https://gerrit.wikimedia.org/r/554036 [15:12:49] (03PS4) 10Alexandros Kosiaris: prometheus: Scrape kube-proxy metrics [puppet] - 10https://gerrit.wikimedia.org/r/554038 [15:12:51] (03PS3) 10Alexandros Kosiaris: prometheus: Allow keeping FQDNs as targets [puppet] - 10https://gerrit.wikimedia.org/r/554059 [15:12:53] (03PS3) 10Alexandros Kosiaris: calico: Keep FQDNs for calico felix prometheus targets [puppet] - 10https://gerrit.wikimedia.org/r/554060 [15:13:04] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:13:16] 10Operations, 10Parsoid-PHP, 10serviceops, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) >>! In T236833#5692042, @jcrespo wrote: > This is ongoing, so adding production error tag: This will be more or less ongoing forever. At best, we... [15:13:18] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:13:34] 10Operations, 10User-jbond: Collects metrics for CAS - https://phabricator.wikimedia.org/T233934 (10jbond) Added [[https://grafana-next.wikimedia.org/d/spring_boot_21/spring-boot-statistics?orgId=1 | New Grafana dashboard ]] >>! In T233934#5705096, @fgiunchedi wrote: > While talking metrics and such for java,... [15:13:38] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:13:42] (03CR) 10Herron: [C: 03+2] logstash: create elk7 ES role and assign to elk7 ES hw hosts [puppet] - 10https://gerrit.wikimedia.org/r/552837 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:13:52] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:14:06] 10Operations, 10Parsoid-PHP, 10serviceops, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) >>! In T236833#5693660, @Joe wrote: > Patches are merged and the memory limit is now at 760 MB, as confirmed by the current OOMs. Not sure how mu... [15:14:23] (03CR) 10Volans: [C: 03+1] "Good catch!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/554087 (owner: 10Elukey) [15:14:25] (03PS1) 10Elukey: sre.aqs.roll-restart.py: fix bug in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/554089 [15:14:50] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:14:56] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:15:02] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:15:25] !log mobrovac@deploy1001 Started deploy [restbase/deploy@d6d5a6e]: Parsoid Proxy: Do not use the fall-back for linting transforms - T239607 [15:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:31] T239607: MediaWiki::restInPeace: transaction round 'MediaWiki\Linter\RecordLintJob::run' still running - https://phabricator.wikimedia.org/T239607 [15:19:15] (03CR) 10Elukey: [C: 03+2] spicerack.remote: fix docstring return type [software/spicerack] - 10https://gerrit.wikimedia.org/r/554087 (owner: 10Elukey) [15:19:42] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:20:55] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:21:52] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:22:08] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:22:30] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:22:56] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:22:57] (03PS1) 10Ssingh: Add scripts for fetching data from OONI [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/554091 [15:23:04] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:23:26] (03Merged) 10jenkins-bot: spicerack.remote: fix docstring return type [software/spicerack] - 10https://gerrit.wikimedia.org/r/554087 (owner: 10Elukey) [15:23:42] 10Operations, 10SRE-Access-Requests, 10WMF-Legal: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10CorinnaHillebrand_WMDE) Hi @colewhite, I encountered an issue indeed. I'm able to connect to the bastion host **bast3004.wikimedia.org** without a problem.... [15:24:04] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:24:08] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:24:32] (03CR) 10Elukey: [C: 03+2] sre.aqs.roll-restart.py: fix bug in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/554089 (owner: 10Elukey) [15:25:10] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:25:36] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was rece [15:25:36] tech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:26:04] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:26:22] !log Rolling restart mw1345-1348 [15:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:50] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:27:00] PROBLEM - mediawiki-installation DSH group on mw1296 is CRITICAL: Host mw1296 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:27:32] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a respo [15:27:32] /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:27:38] (03PS1) 10Elukey: sre.aqs.roll-restart.py: use cumin's alias in the remote query [cookbooks] - 10https://gerrit.wikimedia.org/r/554092 [15:27:40] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [15:27:56] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:28:00] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:28:49] (03CR) 10Elukey: [C: 03+2] sre.aqs.roll-restart.py: use cumin's alias in the remote query [cookbooks] - 10https://gerrit.wikimedia.org/r/554092 (owner: 10Elukey) [15:28:55] (03PS2) 10Mobrovac: Parsoid: Switch groups 0 and 1 to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552972 (https://phabricator.wikimedia.org/T229015) [15:29:14] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:30:02] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:30:15] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [15:30:16] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@d6d5a6e]: Parsoid Proxy: Do not use the fall-back for linting transforms - T239607 (duration: 14m 51s) [15:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:21] T239607: MediaWiki::restInPeace: transaction round 'MediaWiki\Linter\RecordLintJob::run' still running - https://phabricator.wikimedia.org/T239607 [15:31:06] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:31:46] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:31:58] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:32:13] 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10RobH) It appears cp3053 is online at this time. Can I issue a depool command and shut it down for the firmware update at any time or is further scheduling needed? The process will take about 5-15 minu... [15:32:22] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:32:28] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:32:56] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:33:02] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:33:04] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:33:18] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:33:18] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:33:18] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:33:22] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:33:32] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:33:52] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [15:34:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:34:22] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:35:05] (03CR) 10Mobrovac: [C: 03+2] Parsoid: Switch groups 0 and 1 to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552972 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [15:35:35] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [15:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:58] (03Merged) 10jenkins-bot: Parsoid: Switch groups 0 and 1 to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552972 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [15:36:48] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:38:27] !log mobrovac@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Parsoid VRS: Switch groups 0 and 1 to Parsoid/PHP - T229015 (duration: 00m 59s) [15:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:32] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [15:38:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [15:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:23] 10Operations, 10Parsoid-PHP, 10serviceops, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10jcrespo) > if the expectation is that the production error tag will give this higher priority compared other Parsoid bugs, that is not going to be the case... [15:53:00] (03CR) 10Papaul: [C: 03+1] codfw: add missing mgmt records [dns] - 10https://gerrit.wikimedia.org/r/554082 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [15:53:18] (03PS1) 10Herron: elasticsearch: add buster openjdk 8 repository [puppet] - 10https://gerrit.wikimedia.org/r/554095 (https://phabricator.wikimedia.org/T234854) [15:54:49] (03CR) 10Papaul: [C: 03+1] codfw: remove old wezen record [dns] - 10https://gerrit.wikimedia.org/r/554083 (https://phabricator.wikimedia.org/T224564) (owner: 10Volans) [15:54:51] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554095 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:56:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Hardware asset tag Netbox/DNS mgmt inconsistencies - https://phabricator.wikimedia.org/T239597 (10faidon) [15:57:09] (03PS2) 10Herron: elasticsearch: add buster openjdk 8 repository [puppet] - 10https://gerrit.wikimedia.org/r/554095 (https://phabricator.wikimedia.org/T234854) [15:58:53] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5958 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:00:49] (03PS1) 10Jbond: profile::idp: add jmx prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) [16:01:18] (03CR) 10jerkins-bot: [V: 04-1] profile::idp: add jmx prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [16:02:08] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Papaul) [16:03:13] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [16:04:07] (03PS2) 10Jbond: profile::idp: add jmx prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) [16:04:16] (03PS3) 10Jbond: profile::idp: add jmx prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) [16:04:30] the logstash alert looks to be https://phabricator.wikimedia.org/T239090 happening [16:05:03] (03CR) 10Muehlenhoff: install_server: apply standard partman recipes, take #1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [16:07:39] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Papaul) [16:07:41] (03PS4) 10Jbond: profile::idp: add jmx prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) [16:07:57] (03CR) 10Jforrester: [C: 03+1] Add sewikimedia to wikidataclient (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553449 (https://phabricator.wikimedia.org/T239318) (owner: 10Urbanecm) [16:08:00] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces interface-range vlan-private1-b-codfw] - member ge-5/0/40; [edit interfaces interface-range disabled] me... [16:09:25] (03CR) 10Filippo Giunchedi: [C: 03+1] codfw: remove old wezen record [dns] - 10https://gerrit.wikimedia.org/r/554083 (https://phabricator.wikimedia.org/T224564) (owner: 10Volans) [16:09:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: new k8s: introduce apt pinning configurations [puppet] - 10https://gerrit.wikimedia.org/r/554076 (https://phabricator.wikimedia.org/T239409) (owner: 10Arturo Borrero Gonzalez) [16:10:03] 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 replaced with WMF6403 - https://phabricator.wikimedia.org/T232126 (10Papaul) @Volans Thanks [16:10:27] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.008333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:10:36] 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10RobH) confirmed with @ema that depool via command line and power off is fine, moving on to flashing firmware. [16:10:51] !log cp3035 depooling and rebooting for firmware update T239041 [16:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:56] T239041: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 [16:11:01] !log cp3053 depooling and rebooting for firmware update T239041 [16:11:05] there is no cp3035 heh [16:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:16] (03CR) 10Herron: [C: 03+2] elasticsearch: add buster openjdk 8 repository [puppet] - 10https://gerrit.wikimedia.org/r/554095 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [16:12:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554038 (owner: 10Alexandros Kosiaris) [16:13:45] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Allow keeping FQDNs as targets [puppet] - 10https://gerrit.wikimedia.org/r/554059 (owner: 10Alexandros Kosiaris) [16:14:07] PROBLEM - Check systemd state on idp1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "Out of curiosity, which cases do you have in mind?" [puppet] - 10https://gerrit.wikimedia.org/r/554059 (owner: 10Alexandros Kosiaris) [16:14:23] PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:27] (03PS5) 10Jbond: profile::idp: add jmx prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) [16:15:01] (03CR) 10RLazarus: [C: 03+1] "Thanks! This is my first package release so I'm definitely feeling around in the dark a little." [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736 (owner: 10Hashar) [16:15:31] (03CR) 10Filippo Giunchedi: [C: 03+1] calico: Keep FQDNs for calico felix prometheus targets [puppet] - 10https://gerrit.wikimedia.org/r/554060 (owner: 10Alexandros Kosiaris) [16:15:53] RECOVERY - Check systemd state on idp1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:06] (03PS6) 10Jbond: profile::idp: add jmx prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/554097 (https://phabricator.wikimedia.org/T233934) [16:19:23] !log reimage mw1295.eqiad.wmnet mw1294.eqiad.wmnet mw1293.eqiad.wmnet [16:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:56] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1295.eqiad.wmnet', 'mw1294.eqiad.wmnet', 'mw1293.eqiad.wmnet'] ` The log can be found in `/var/log/... [16:22:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:23:52] (03CR) 10Alexandros Kosiaris: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/554059 (owner: 10Alexandros Kosiaris) [16:24:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] kube-proxy: Allow overriding metrics bind address [puppet] - 10https://gerrit.wikimedia.org/r/554036 (owner: 10Alexandros Kosiaris) [16:27:33] RECOVERY - Host cp3053 is UP: PING OK - Packet loss = 0%, RTA = 83.36 ms [16:27:45] !log mobrovac@deploy1001 Started deploy [restbase/deploy@3516382]: Switch ru, sr and zh wikipediae to Parsoid/PHP - T229015 [16:27:49] RECOVERY - mediawiki-installation DSH group on mw1296 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:50] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [16:29:30] (03PS1) 10Herron: logstash: set elk7 es config_version to 7 [puppet] - 10https://gerrit.wikimedia.org/r/554101 (https://phabricator.wikimedia.org/T234854) [16:31:46] 10Operations, 10ops-esams, 10Traffic: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 (10RobH) a:05Vgutierrez→03ema All ilom and bios updated, irc update to @ema and handing this back to #traffic. [16:32:09] !log cp3053: repooling after firmware update T239041 [16:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:14] T239041: cp3053 is unreachable - https://phabricator.wikimedia.org/T239041 [16:33:07] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5833 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:33:12] (03CR) 10Herron: [C: 03+2] logstash: set elk7 es config_version to 7 [puppet] - 10https://gerrit.wikimedia.org/r/554101 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [16:34:53] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:35:15] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:35:25] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:35:59] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/ [16:35:59] iew mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:36:21] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [16:36:21] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:36:21] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:36:21] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:36:21] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:36:22] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:36:22] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:36:23] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:36:27] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response wa [16:36:27] ain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:36:39] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response wa [16:36:39] ain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:36:43] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:36:50] ^ looking [16:37:13] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:37:39] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:38:03] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:38:09] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:38:11] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:38:23] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:38:27] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:38:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:38:49] PROBLEM - Nginx local proxy to apache on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:51] PROBLEM - PHP7 rendering on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:38:51] PROBLEM - Nginx local proxy to apache on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:51] PROBLEM - Nginx local proxy to apache on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:51] PROBLEM - PHP7 rendering on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:38:55] PROBLEM - PHP7 rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:38:57] PROBLEM - Nginx local proxy to apache on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:38:57] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:38:59] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:01] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:01] PROBLEM - PHP7 rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:01] PROBLEM - Apache HTTP on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:01] PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:03] uh [16:39:03] PROBLEM - Nginx local proxy to apache on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:03] PROBLEM - Nginx local proxy to apache on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:03] PROBLEM - Nginx local proxy to apache on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:03] PROBLEM - Apache HTTP on mw1343 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:05] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:05] PROBLEM - PHP7 rendering on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:05] PROBLEM - PHP7 rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:39:05] PROBLEM - Apache HTTP on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:07] PROBLEM - Nginx local proxy to apache on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:07] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:09] PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:39:21] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [16:39:21] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:39:50] that's a lot of apaches timing out [16:39:55] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:40:26] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Traffic: https://releases-jenkins.wikimedia.org yields a 502 unrecheable - https://phabricator.wikimedia.org/T239629 (10hashar) [16:40:27] RECOVERY - Nginx local proxy to apache on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:29] RECOVERY - PHP7 rendering on mw1316 is OK: HTTP OK: HTTP/1.1 200 OK - 80020 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:29] RECOVERY - Nginx local proxy to apache on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:31] RECOVERY - Nginx local proxy to apache on mw1344 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.753 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:31] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:40:33] RECOVERY - PHP7 rendering on mw1341 is OK: HTTP OK: HTTP/1.1 200 OK - 80021 bytes in 4.275 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:35] RECOVERY - PHP7 rendering on mw1344 is OK: HTTP OK: HTTP/1.1 200 OK - 80021 bytes in 3.702 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:37] RECOVERY - Nginx local proxy to apache on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:37] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:37] RECOVERY - PHP7 rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 80020 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:39] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 2.610 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:39] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 80021 bytes in 3.569 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:39] RECOVERY - Nginx local proxy to apache on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:39] RECOVERY - Nginx local proxy to apache on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:41] PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [16:40:41] RECOVERY - Apache HTTP on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.271 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:41] RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.545 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:41] RECOVERY - Nginx local proxy to apache on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:43] RECOVERY - Apache HTTP on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:43] RECOVERY - PHP7 rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 80020 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:43] RECOVERY - PHP7 rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 80020 bytes in 0.160 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:40:43] RECOVERY - Nginx local proxy to apache on mw1222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:43] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:43] RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.908 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:44] RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:49] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.817 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:40:58] that’s a lot of apaches recovering [16:41:03] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:41:38] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@3516382]: Switch ru, sr and zh wikipediae to Parsoid/PHP - T229015 (duration: 13m 53s) [16:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:43] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [16:42:02] !log Restarted CI Jenkins [16:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:21] PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/references/{title} (Get references from storage) timed out before a response was r [16:42:21] pedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a [16:42:21] id) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikiped [16:42:21] rm/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [16:42:21] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [16:42:22] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:42:31] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [16:42:31] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:42:31] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [16:42:31] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:42:35] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) timed out before a response was received: /_info (retrieve service info) timed out before a response was received https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [16:42:37] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:42:37] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:42:37] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:42:39] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:42:39] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:42:39] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:43:06] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [16:43:07] !log restart all API cluster in eqiad [16:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:13] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [16:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:33] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:43:51] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:44:03] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:44:07] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:44:13] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [16:44:17] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:45:12] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:52] would anyone here be able to add me (https://github.com/bearloga/) to https://github.com/wikimedia/? please and thank you :) [16:45:57] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:46:19] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [16:46:37] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:46:57] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:47:46] (03PS1) 10Herron: logstash: set elk7 es heap_memory to 24G [puppet] - 10https://gerrit.wikimedia.org/r/554103 (https://phabricator.wikimedia.org/T234854) [16:48:38] (03CR) 10Herron: [C: 03+2] logstash: set elk7 es heap_memory to 24G [puppet] - 10https://gerrit.wikimedia.org/r/554103 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [16:49:41] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:49:45] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:50:17] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:51:17] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:51:53] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:52:31] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:52:35] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:53:05] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:54:13] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:54:39] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:54:55] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:54:55] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:54:55] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:56:03] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:56:03] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:57:23] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:57:23] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:58:41] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:58:59] mobrovac: looks like we're having issues that correlate time-wise with your deploy https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-6h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&fullscreen&panelId=20 [16:59:44] mobrovac: discussion ongoing on #wikimedia-sre, please join! [17:00:05] ema: yes, parsiod.php related, there should be a deploy now [17:00:53] mobrovac: ack thanks [17:00:55] jouncebot: next [17:00:55] In 0 hour(s) and 59 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191202T1800) [17:01:38] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10akosiaris) @wiki_willy We haven't even started the process of refreshing those nodes and they host important services. I 'd rather we just replaced the disk. [17:02:51] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1003 - https://phabricator.wikimedia.org/T239217 (10Marostegui) Any update on this? Thanks! [17:05:37] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:11:14] 10Operations, 10Puppet, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10akosiaris) [17:14:06] !log ssastry@deploy1001 Started deploy [parsoid/deploy@743efb0]: Updating Parsoid to ca588b25 + fix broken langconv library / deploy [17:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:31] 10Operations: Add Daimona to #mediawiki_security - https://phabricator.wikimedia.org/T239093 (10jbond) a:03RobH [17:21:54] !log ssastry@deploy1001 Finished deploy [parsoid/deploy@743efb0]: Updating Parsoid to ca588b25 + fix broken langconv library / deploy (duration: 07m 48s) [17:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:33] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@deafe56]: Followup on cirrusSearchElasticWrite partitioning T230495 [17:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:38] T230495: Partition CirrusSearch mediawiki jobs by cluster - https://phabricator.wikimedia.org/T230495 [17:29:47] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@deafe56]: Followup on cirrusSearchElasticWrite partitioning T230495 (duration: 01m 14s) [17:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:21] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-err https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometh [17:32:21] er=logging-eqiad&var-topic=All&var-consumer_group=All [17:36:44] (03CR) 10Bstorm: "LGTM...there was some log parsing things Bryan was doing at some point. I don't know if it matters here." [puppet] - 10https://gerrit.wikimedia.org/r/554041 (https://phabricator.wikimedia.org/T238641) (owner: 10Arturo Borrero Gonzalez) [17:41:24] 10Operations: Add Daimona to #mediawiki_security - https://phabricator.wikimedia.org/T239093 (10Dzahn) Approved in SRE meeting. [17:42:43] !log mobrovac@deploy1001 Started deploy [restbase/deploy@ff7862f]: Switch sr and zh wikipediae back to Parsoid/JS - T229015 [17:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:48] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [17:43:34] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2240.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021743_dzahn_48992_mw224... [17:45:25] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2239.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021745_dzahn_49266_mw223... [17:49:08] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [17:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:37] (03PS1) 10Phamhi: labmon: update graphite-web to be compatible with buster/stretch [puppet] - 10https://gerrit.wikimedia.org/r/554114 (https://phabricator.wikimedia.org/T224585) [17:50:20] (03CR) 10jerkins-bot: [V: 04-1] labmon: update graphite-web to be compatible with buster/stretch [puppet] - 10https://gerrit.wikimedia.org/r/554114 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [17:51:17] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:49] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@ff7862f]: Switch sr and zh wikipediae back to Parsoid/JS - T229015 (duration: 14m 06s) [17:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:54] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [17:56:55] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10RStallman-legalteam) Hi @rxy, To create an NDA for you I will need your full name, and both mailing and email address. You can send these details to rstallman[at]wikimedia[dot]org. Best, R... [17:57:39] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2238.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021757_dzahn_50879_mw223... [17:59:26] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2237.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021759_dzahn_51262_mw223... [18:00:04] gehel and onimisionipe: Time to snap out of that daydream and deploy Wikidata Query Service weekly deploy. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191202T1800). [18:00:13] jouncebot: ack [18:00:43] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@97d17f6]: New blazegraph and WDQS build plus GUI changes [18:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:43] (03PS1) 10Phamhi: labmon: update graphite-web to be compatible with buster/stretch [puppet] - 10https://gerrit.wikimedia.org/r/554115 (https://phabricator.wikimedia.org/T224585) [18:05:15] (03Abandoned) 10Phamhi: labmon: update graphite-web to be compatible with buster/stretch [puppet] - 10https://gerrit.wikimedia.org/r/554114 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [18:05:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:08] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) For new gerrit patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/554115, I have upda... [18:12:20] (03CR) 10Dzahn: "@Paladox I think the solution is to add fake passwords into labs/private, not using "if $realm"." [puppet] - 10https://gerrit.wikimedia.org/r/551268 (https://phabricator.wikimedia.org/T238425) (owner: 10Dzahn) [18:15:17] (03CR) 10Dzahn: "ah, it's the absence of admin groups data in labs, not the passwords. I don't know then. The proper fix seems to be to have fake admin dat" [puppet] - 10https://gerrit.wikimedia.org/r/551268 (https://phabricator.wikimedia.org/T238425) (owner: 10Dzahn) [18:16:25] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@97d17f6]: New blazegraph and WDQS build plus GUI changes (duration: 15m 42s) [18:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:45] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/553200 (owner: 10Jforrester) [18:18:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:59] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:14] (03CR) 10Jforrester: Turn off redirect on exact search match for commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553510 (https://phabricator.wikimedia.org/T235263) (owner: 10Cparle) [18:27:03] (03CR) 10Jforrester: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/553200 (owner: 10Jforrester) [18:30:29] !log joal@deploy1001 Started deploy [analytics/refinery@980298b]: Analytics deploy - Fixes for today deploy [18:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:20] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [18:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:50] !log joal@deploy1001 Finished deploy [analytics/refinery@980298b]: Analytics deploy - Fixes for today deploy (duration: 08m 21s) [18:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:08] !log joal@deploy1001 Started deploy [analytics/refinery@980298b] (thin): Analytics deploy - Fixes for today deploy [18:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:14] !log joal@deploy1001 Finished deploy [analytics/refinery@980298b] (thin): Analytics deploy - Fixes for today deploy (duration: 00m 06s) [18:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:15] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.525 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:44:01] (03CR) 10Dzahn: varnish/ATS: rename director for OTRS from mendelevium to otrs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553423 (owner: 10Dzahn) [18:49:57] (03PS1) 10Dzahn: add ticket.discovery.wmnet, point to mendelevium [dns] - 10https://gerrit.wikimedia.org/r/554125 (https://phabricator.wikimedia.org/T210411) [18:50:04] !log mobrovac@deploy1001 Started deploy [restbase/deploy@6a24685]: Parsoid Proxy: Direct html2html traffic to JS; Stop honouring the variant header; Switch sr and zh wikis to PHP - T229015 [18:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:10] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [18:54:10] (03CR) 10Dzahn: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/553424 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [18:54:50] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2240.codfw.wmnet'] ` and were **ALL** successful. [18:55:49] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2239.codfw.wmnet'] ` and were **ALL** successful. [18:56:37] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [18:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:46] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191202T1900). Please do the needful. [19:00:04] Krenair: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:02:14] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2245.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021901_dzahn_65268_mw224... [19:03:03] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2234.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021902_dzahn_65614_mw223... [19:04:15] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@6a24685]: Parsoid Proxy: Direct html2html traffic to JS; Stop honouring the variant header; Switch sr and zh wikis to PHP - T229015 (duration: 14m 11s) [19:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:21] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [19:04:25] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.525 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:09:30] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2238.codfw.wmnet'] ` and were **ALL** successful. [19:11:00] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2237.codfw.wmnet'] ` and were **ALL** successful. [19:16:18] !log mobrovac@deploy1001 Started deploy [restbase/deploy@e69e2e5] (dev-cluster): Switch everything but enwiki to Parsoid/PHP [19:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:53] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:18:26] (03CR) 10Dzahn: "We are doing it this way for all the misc things that should have a discovery record but don't have geoDNS" [dns] - 10https://gerrit.wikimedia.org/r/554125 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [19:22:56] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@e69e2e5] (dev-cluster): Switch everything but enwiki to Parsoid/PHP (duration: 06m 38s) [19:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:39] !log mobrovac@deploy1001 Started deploy [restbase/deploy@e69e2e5]: Switch everything but enwiki to Parsoid/PHP - T229015 [19:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:44] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [19:24:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:10] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2246.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021923_dzahn_71993_mw224... [19:25:37] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` mw2233.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021925_ariel_72277_mw223... [19:25:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:46] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` mw2232.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021925_ariel_72297_mw223... [19:26:02] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2247.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021925_dzahn_72347_mw224... [19:26:05] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` mw2231.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021925_ariel_72344_mw223... [19:26:18] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` mw2234.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021926_ariel_72413_mw223... [19:26:21] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2234.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2234.codfw.wmnet'] ` [19:27:14] (03PS7) 10Ottomata: Kafka producer TLS support for eventgate charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) [19:27:16] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` mw2234.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021927_ariel_72531_mw223... [19:27:21] (03CR) 10Ottomata: Kafka producer TLS support for eventgate charts (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [19:27:47] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:16] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5125 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:28:20] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) [19:29:08] (03PS8) 10Ottomata: Kafka producer TLS support for eventgate charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) [19:30:28] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2234.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2234.codfw.wmnet'] ` [19:33:58] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2230.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912021933_dzahn_74631_mw223... [19:37:27] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@e69e2e5]: Switch everything but enwiki to Parsoid/PHP - T229015 (duration: 13m 48s) [19:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:32] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [19:39:04] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [19:40:33] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1295.eqiad.wmnet', 'mw1294.eqiad.wmnet', 'mw1293.eqiad.wmnet'] ` and were **ALL** successful. [19:41:30] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0875 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:48:50] !log ariel@cumin1001 START - Cookbook sre.hosts.downtime [19:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:38] 10Operations, 10SRE-Access-Requests: Requesting access to production shell for Maryum Styles - https://phabricator.wikimedia.org/T239654 (10Mstyles) [19:50:01] !log ariel@cumin1001 START - Cookbook sre.hosts.downtime [19:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:27] !log ariel@cumin1001 START - Cookbook sre.hosts.downtime [19:50:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:32] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [19:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:37] (03CR) 10Mstyles: "I created the shell account access request here: https://phabricator.wikimedia.org/T239654" [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239300) (owner: 10Mstyles) [19:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:59] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:51:01] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:15] (03CR) 10Mstyles: ">" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239300) (owner: 10Mstyles) [19:51:36] !log ariel@cumin1001 START - Cookbook sre.hosts.downtime [19:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:01] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:01] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:09] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:36] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:18] PROBLEM - nutcracker process on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker [19:57:18] PROBLEM - mcrouter process on mw2232 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.58: Connection reset by peer https://wikitech.wikimedia.org/wiki/Mcrouter [19:57:28] PROBLEM - Nginx local proxy to videoscaler on mw2247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [19:57:35] (03PS7) 10Herron: logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) [19:59:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:36] PROBLEM - mediawiki-installation DSH group on mw2232 is CRITICAL: Host mw2232 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:59:36] PROBLEM - nutcracker socket on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker [19:59:57] !log joal@deploy1001 Started deploy [analytics/refinery@9cd234a]: Analytics deploy - Fixes for today deploy (2) [20:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:43] :/ [20:00:53] (03PS2) 10Mstyles: add dcausse user back [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239300) [20:00:58] guess my thing can wait until the next swat window [20:00:58] PROBLEM - PHP7 jobrunner on mw2247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:01:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:52] PROBLEM - Apache HTTP on mw2231 is CRITICAL: connect to address 10.192.0.57 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [20:02:02] PROBLEM - MD RAID on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:02:02] PROBLEM - php7.2-fpm service on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:02:04] PROBLEM - nutcracker process on mw2232 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.58: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker [20:02:22] !log mobrovac@deploy1001 Started deploy [restbase/deploy@92acf1e] (dev-cluster): Switch everything to Parsoid/PHP [20:02:23] (03PS2) 10Reedy: deployment-prep: Replace stretch poolcounter with a buster one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553806 (owner: 10Alex Monk) [20:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:26] (03CR) 10Reedy: [C: 03+2] deployment-prep: Replace stretch poolcounter with a buster one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553806 (owner: 10Alex Monk) [20:02:29] Krenair: That's an easy one [20:02:37] (03CR) 10jerkins-bot: [V: 04-1] add dcausse user back [puppet] - 10https://gerrit.wikimedia.org/r/553219 (https://phabricator.wikimedia.org/T239300) (owner: 10Mstyles) [20:03:05] Reedy: <3 [20:03:07] Reedy: Assuming it works. ;-) [20:03:20] if it doesn't it's equally easy to just roll back [20:03:24] (03Merged) 10jenkins-bot: deployment-prep: Replace stretch poolcounter with a buster one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553806 (owner: 10Alex Monk) [20:03:35] and all that would happen is you can't edit beta for a bit [20:03:36] so shrug [20:03:43] * James_F nods. [20:03:56] those criticals are because the reimage script failed to downtime in icinga >_< [20:04:06] PROBLEM - nutcracker socket on mw2232 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.58: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker [20:04:06] PROBLEM - Check size of conntrack table on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:04:06] PROBLEM - puppet last run on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:04:14] PROBLEM - PHP7 rendering on mw2247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:05:09] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@92acf1e] (dev-cluster): Switch everything to Parsoid/PHP (duration: 02m 48s) [20:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:20] !log reedy@deploy1001 Synchronized wmf-config/LabsServices.php: labslabslabs (duration: 01m 08s) [20:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:39] 20:05:14 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw2245.codfw.wmnet returned [1]: [20:05:39] We trust you have received the usual lecture from the local System [20:05:39] Administrator. It usually boils down to these three things: [20:05:40] lol [20:05:48] Ha. [20:06:06] PROBLEM - Apache HTTP on mw2232 is CRITICAL: connect to address 10.192.0.58 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [20:06:12] PROBLEM - Check systemd state on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:12] PROBLEM - php7.2-fpm service on mw2232 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.58: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:06:12] PROBLEM - MD RAID on mw2232 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.58: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:06:58] !log mobrovac@deploy1001 Started deploy [restbase/deploy@92acf1e]: Switch everything to Parsoid/PHP - T229015 [20:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:03] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [20:08:06] !log joal@deploy1001 Finished deploy [analytics/refinery@9cd234a]: Analytics deploy - Fixes for today deploy (2) (duration: 08m 08s) [20:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:38] PROBLEM - Nginx local proxy to apache on mw2231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:09:15] Reedy, stopped poolcounter on the old instance, looks like stuff still works, thank you! [20:09:46] (03CR) 10Ottomata: [C: 03+2] Kafka producer TLS support for eventgate charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [20:10:05] Krenair: Thanks for your work making the new stuff and killing off the old stuff. [20:10:24] mobrovac is switching enwiki to use parsoid/php instead of parsoid/js now. [20:10:45] :) [20:12:25] !log joal@deploy1001 Started deploy [analytics/refinery@9cd234a] (thin): Analytics deploy - Fixes for today deploy (2) [20:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:31] !log joal@deploy1001 Finished deploy [analytics/refinery@9cd234a] (thin): Analytics deploy - Fixes for today deploy (2) (duration: 00m 05s) [20:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:00] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2245.codfw.wmnet'] ` and were **ALL** successful. [20:13:09] (03PS2) 10Phamhi: labmon: update graphite-web to be compatible with buster/stretch [puppet] - 10https://gerrit.wikimedia.org/r/554115 (https://phabricator.wikimedia.org/T224585) [20:13:12] PROBLEM - PHP7 rendering on mw2231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:13:12] PROBLEM - Nginx local proxy to apache on mw2232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:14:46] (03PS3) 10Phamhi: labmon: update graphite-web to be compatible with buster/stretch [puppet] - 10https://gerrit.wikimedia.org/r/554115 (https://phabricator.wikimedia.org/T224585) [20:17:53] PROBLEM - mediawiki-installation DSH group on mw2247 is CRITICAL: Host mw2247 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:21:57] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@92acf1e]: Switch everything to Parsoid/PHP - T229015 (duration: 14m 59s) [20:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:02] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [20:22:05] PROBLEM - PHP7 rendering on mw2232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:22:57] (03PS1) 10Ottomata: Enable Kafka Producer TLS for eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/554144 (https://phabricator.wikimedia.org/T236386) [20:24:00] (03CR) 10Ottomata: [C: 03+2] Enable Kafka Producer TLS for eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/554144 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [20:25:47] PROBLEM - Nginx local proxy to jobrunner on mw2247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:25:51] PROBLEM - mediawiki-installation DSH group on mw2231 is CRITICAL: Host mw2231 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:26:20] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [20:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:05] (03PS1) 10Jhedden: tools: add mountstats to bastion node exporter [puppet] - 10https://gerrit.wikimedia.org/r/554145 [20:27:09] PROBLEM - Check systemd state on mw2247 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.73: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:13] PROBLEM - nutcracker socket on mw2247 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.73: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker [20:27:13] PROBLEM - MD RAID on mw2247 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.73: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:27:13] PROBLEM - Check size of conntrack table on mw2247 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.73: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:27:13] PROBLEM - php7.2-fpm service on mw2247 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.73: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:27:35] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2247 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.73: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [20:27:51] RECOVERY - PHP7 jobrunner on mw2247 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:28:11] RECOVERY - Nginx local proxy to jobrunner on mw2247 is OK: HTTP OK: HTTP/1.1 200 OK - 339 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:28:25] RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:26] (03CR) 10Jhedden: [C: 03+2] tools: add mountstats to bastion node exporter [puppet] - 10https://gerrit.wikimedia.org/r/554145 (owner: 10Jhedden) [20:28:31] RECOVERY - nutcracker socket on mw2247 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_codfw.sock https://wikitech.wikimedia.org/wiki/Nutcracker [20:28:31] RECOVERY - Check size of conntrack table on mw2247 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:28:31] RECOVERY - php7.2-fpm service on mw2247 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:28:31] RECOVERY - MD RAID on mw2247 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:28:49] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.113 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:29:00] (03PS1) 10Mobrovac: Parsoid VRS: Switch all clients to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554146 (https://phabricator.wikimedia.org/T229015) [20:30:19] (03CR) 10Subramanya Sastry: "all clients = Flow right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554146 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [20:32:03] (03CR) 10Mobrovac: "> all clients = Flow right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554146 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [20:32:19] (03PS1) 10Ottomata: Use Kafka TLS port for eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/554148 (https://phabricator.wikimedia.org/T236386) [20:32:52] (03CR) 10Subramanya Sastry: "let's do it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554146 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [20:32:59] RECOVERY - Nginx local proxy to videoscaler on mw2247 is OK: HTTP OK: HTTP/1.1 200 OK - 339 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:33:15] (03PS2) 10Ottomata: Use Kafka TLS port for eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/554148 (https://phabricator.wikimedia.org/T236386) [20:33:21] PROBLEM - MD RAID on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:33:21] PROBLEM - php7.2-fpm service on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:33:39] PROBLEM - nutcracker process on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker [20:33:53] PROBLEM - PHP opcache health on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:33:57] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2247.codfw.wmnet'] ` and were **ALL** successful. [20:34:02] (03CR) 10Herron: [C: 03+2] logstash: create elk7 logstash role and assign to elk7 collectors [puppet] - 10https://gerrit.wikimedia.org/r/552881 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [20:34:21] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] Parsoid VRS: Switch all clients to Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554146 (https://phabricator.wikimedia.org/T229015) (owner: 10Mobrovac) [20:34:25] RECOVERY - nutcracker socket on mw2231 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_codfw.sock https://wikitech.wikimedia.org/wiki/Nutcracker [20:34:27] (03CR) 10Ottomata: [C: 03+2] Use Kafka TLS port for eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/554148 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [20:34:29] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6875 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:35:13] RECOVERY - PHP7 rendering on mw2247 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:35:15] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [20:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:29] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2246.codfw.wmnet'] ` and were **ALL** successful. [20:36:14] !log mobrovac@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch Flow on all wikis to Parsoid/PHP - T229015 (duration: 00m 59s) [20:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:19] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [20:36:25] RECOVERY - MD RAID on mw2231 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:36:25] RECOVERY - php7.2-fpm service on mw2231 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:36:43] RECOVERY - nutcracker process on mw2231 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [20:36:57] RECOVERY - PHP opcache health on mw2231 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:37:08] (03CR) 10Andrew Bogott: [C: 03+2] keystone-paste: remove refs to simple_cert_extension [puppet] - 10https://gerrit.wikimedia.org/r/553880 (owner: 10Andrew Bogott) [20:37:23] RECOVERY - nutcracker socket on mw2232 is OK: TCP OK - 0.000 second response time on socket /var/run/nutcracker/redis_codfw.sock https://wikitech.wikimedia.org/wiki/Nutcracker [20:37:51] RECOVERY - mcrouter process on mw2232 is OK: PROCS OK: 1 process with UID = 113 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [20:37:57] RECOVERY - nutcracker process on mw2232 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [20:38:19] RECOVERY - Apache HTTP on mw2232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.502 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:38:39] RECOVERY - PHP7 rendering on mw2232 is OK: HTTP OK: HTTP/1.1 200 OK - 80036 bytes in 9.859 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:38:49] yeah yeah thanks icinga :-/ [20:39:11] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2231.codfw.wmnet'] ` and were **ALL** successful. [20:39:15] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2233.codfw.wmnet'] ` and were **ALL** successful. [20:39:17] RECOVERY - PHP7 rendering on mw2231 is OK: HTTP OK: HTTP/1.1 200 OK - 80034 bytes in 0.368 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:39:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Hardware asset tag Netbox/DNS mgmt inconsistencies - https://phabricator.wikimedia.org/T239597 (10wiki_willy) a:03Jclark-ctr [20:39:25] RECOVERY - Nginx local proxy to apache on mw2231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:39:27] 10Operations, 10SRE-Access-Requests, 10WMF-Legal: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10colewhite) Hi Corinna! stat1007 is a private data statistics node and I believe access is controlled by statistics-privatedata-users. @Nuria, would it be... [20:39:31] RECOVERY - Apache HTTP on mw2231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:39:40] 10Operations, 10SRE-Access-Requests, 10WMF-Legal: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10colewhite) 05Resolved→03Open [20:41:29] PROBLEM - php7.2-fpm service on mw2232 is CRITICAL: NRPE: Command check_php7.2-fpm-state not defined https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:41:35] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2234.codfw.wmnet'] ` and were **ALL** successful. [20:41:39] PROBLEM - PHP opcache health on mw2232 is CRITICAL: NRPE: Command check_opcache not defined https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:41:43] RECOVERY - Check size of conntrack table on mw2231 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:41:43] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:42:06] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2232.codfw.wmnet'] ` and were **ALL** successful. [20:43:07] RECOVERY - php7.2-fpm service on mw2232 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:43:17] RECOVERY - PHP opcache health on mw2232 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:43:34] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) @akosiaris I merged and applied https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551610 in staging.... [20:43:55] RECOVERY - Check systemd state on mw2231 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:55] RECOVERY - MD RAID on mw2232 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:45:43] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.04583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [20:46:15] !log ariel@cumin1001 conftool action : set/pooled=yes; selector: cluster=appserver,name=mw2231.codfw.wmnet,service=apache2,dc=codfw [20:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:26] !log ariel@cumin1001 conftool action : set/pooled=yes; selector: cluster=appserver,name=mw2231.codfw.wmnet,service=nginx,dc=codfw [20:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:40] !log ariel@cumin1001 conftool action : set/pooled=yes; selector: name=mw2234.codfw.wmnet,service=apache2,cluster=appserver,dc=codfw [20:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:46] !log ariel@cumin1001 conftool action : set/pooled=yes; selector: name=mw2234.codfw.wmnet,service=nginx,cluster=appserver,dc=codfw [20:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:56] !log ariel@cumin1001 conftool action : set/pooled=yes; selector: name=mw2233.codfw.wmnet,dc=codfw,service=apache2,cluster=appserver [20:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:03] !log ariel@cumin1001 conftool action : set/pooled=yes; selector: name=mw2233.codfw.wmnet,dc=codfw,service=nginx,cluster=appserver [20:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:16] !log ariel@cumin1001 conftool action : set/pooled=yes; selector: cluster=appserver,name=mw2232.codfw.wmnet,service=apache2,dc=codfw [20:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:22] !log ariel@cumin1001 conftool action : set/pooled=yes; selector: cluster=appserver,name=mw2232.codfw.wmnet,service=nginx,dc=codfw [20:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:53] RECOVERY - Nginx local proxy to apache on mw2232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.203 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:47:54] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2230.codfw.wmnet'] ` and were **ALL** successful. [20:53:29] (03PS5) 10Andrew Bogott: Neutron l3_agent: remove external_network_bridge config option [puppet] - 10https://gerrit.wikimedia.org/r/553883 [20:53:31] (03PS1) 10Andrew Bogott: nova.conf ocata: remove [spice] config section [puppet] - 10https://gerrit.wikimedia.org/r/554151 [20:54:35] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf ocata: remove [spice] config section [puppet] - 10https://gerrit.wikimedia.org/r/554151 (owner: 10Andrew Bogott) [20:57:47] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2247 is OK: OK: synced at Mon 2019-12-02 20:57:46 UTC. https://wikitech.wikimedia.org/wiki/NTP [20:59:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1078 after schema change', diff saved to https://phabricator.wikimedia.org/P9796 and previous config saved to /var/cache/conftool/dbconfig/20191202-205904-marostegui.json [20:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:53] RECOVERY - mediawiki-installation DSH group on mw2232 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:00:05] cscott, arlolra, subbu, halfak, and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191202T2100). [21:03:31] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) [21:04:15] (03PS1) 10Herron: logstash: add logstash_package param and set elk7 to -oss variant [puppet] - 10https://gerrit.wikimedia.org/r/554152 (https://phabricator.wikimedia.org/T234854) [21:05:08] (03CR) 10jerkins-bot: [V: 04-1] logstash: add logstash_package param and set elk7 to -oss variant [puppet] - 10https://gerrit.wikimedia.org/r/554152 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:06:30] (03PS2) 10Herron: logstash: add logstash_package param and set elk7 to -oss variant [puppet] - 10https://gerrit.wikimedia.org/r/554152 (https://phabricator.wikimedia.org/T234854) [21:08:09] RECOVERY - mediawiki-installation DSH group on mw2231 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:10:50] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/19721/" [puppet] - 10https://gerrit.wikimedia.org/r/554152 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:11:18] (03CR) 10Herron: [C: 03+2] logstash: add logstash_package param and set elk7 to -oss variant [puppet] - 10https://gerrit.wikimedia.org/r/554152 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:14:51] (03PS1) 10Mholloway: Bump wikifeeds to 2019-11-27-175327-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/554154 [21:15:00] (03PS5) 10Ottomata: Set up cache routing for schema.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/549177 (https://phabricator.wikimedia.org/T233630) [21:15:51] PROBLEM - PHP opcache health on mw2240 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:15:52] (03CR) 10Mholloway: [C: 03+2] Bump wikifeeds to 2019-11-27-175327-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/554154 (owner: 10Mholloway) [21:16:05] PROBLEM - PHP opcache health on mw2239 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:16:07] (03Merged) 10jenkins-bot: Bump wikifeeds to 2019-11-27-175327-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/554154 (owner: 10Mholloway) [21:20:42] (03PS1) 10BBlack: Add new metrics up through pdns 4.1 for buster [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/554155 [21:20:44] (03PS1) 10BBlack: Bump d/changelog to 0.8 for buster-wikimedia [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/554156 [21:21:14] (03PS1) 10Herron: kibana: add kibana_package param and set elk7 hosts to -oss variant [puppet] - 10https://gerrit.wikimedia.org/r/554157 (https://phabricator.wikimedia.org/T234854) [21:21:49] (03CR) 10jerkins-bot: [V: 04-1] kibana: add kibana_package param and set elk7 hosts to -oss variant [puppet] - 10https://gerrit.wikimedia.org/r/554157 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:22:01] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [21:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:48] (03PS2) 10Herron: kibana: add kibana_package param and set elk7 hosts to -oss variant [puppet] - 10https://gerrit.wikimedia.org/r/554157 (https://phabricator.wikimedia.org/T234854) [21:23:23] (03CR) 10jerkins-bot: [V: 04-1] kibana: add kibana_package param and set elk7 hosts to -oss variant [puppet] - 10https://gerrit.wikimedia.org/r/554157 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:23:46] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [21:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:40] (03PS3) 10Herron: kibana: add kibana_package param and set elk7 hosts to -oss variant [puppet] - 10https://gerrit.wikimedia.org/r/554157 (https://phabricator.wikimedia.org/T234854) [21:25:42] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [21:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2230.codfw.wmnet [21:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:32] (03PS1) 10Phamhi: wmcs: don't process lines starting with a comment [puppet] - 10https://gerrit.wikimedia.org/r/554159 (https://phabricator.wikimedia.org/T235743) [21:28:19] (03CR) 10BBlack: [V: 03+2 C: 03+2] Add new metrics up through pdns 4.1 for buster [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/554155 (owner: 10BBlack) [21:28:22] (03CR) 10BBlack: [C: 03+2] Bump d/changelog to 0.8 for buster-wikimedia [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/554156 (owner: 10BBlack) [21:28:32] (03CR) 10Bstorm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/554159 (https://phabricator.wikimedia.org/T235743) (owner: 10Phamhi) [21:28:57] (03CR) 10Phamhi: [C: 03+2] wmcs: don't process lines starting with a comment [puppet] - 10https://gerrit.wikimedia.org/r/554159 (https://phabricator.wikimedia.org/T235743) (owner: 10Phamhi) [21:29:37] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1003 - https://phabricator.wikimedia.org/T239217 (10Jclark-ctr) just received drive from warehouse [21:31:26] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/19722/" [puppet] - 10https://gerrit.wikimedia.org/r/554157 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:31:38] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2246.codfw.wmnet [21:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:46] (03CR) 10Herron: [C: 03+2] kibana: add kibana_package param and set elk7 hosts to -oss variant [puppet] - 10https://gerrit.wikimedia.org/r/554157 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:32:03] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2247.codfw.wmnet [21:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:32] 10Operations, 10SRE-Access-Requests, 10WMF-Legal: Requesting access to view EventLogging data for Co_WMDE - https://phabricator.wikimedia.org/T234429 (10Nuria) @colewhite group is analytics-private-data-users, as long as NDA is in place this sounds fine. [21:33:45] RECOVERY - PHP opcache health on mw2239 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:34:41] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2248.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912022134_dzahn_104033_mw22... [21:35:15] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2249.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912022135_dzahn_105363_mw22... [21:36:33] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2251.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912022136_dzahn_105527_mw22... [21:37:38] (03PS1) 10Herron: logstash: set es config_version to 7 on elk7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/554160 (https://phabricator.wikimedia.org/T234854) [21:39:09] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/19723/" [puppet] - 10https://gerrit.wikimedia.org/r/554160 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:39:12] (03CR) 10Herron: [C: 03+2] logstash: set es config_version to 7 on elk7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/554160 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:39:37] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10Dzahn) [21:42:21] RECOVERY - PHP opcache health on mw2240 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:44:49] (03CR) 10Dzahn: [C: 03+2] add ticket.discovery.wmnet, point to mendelevium [dns] - 10https://gerrit.wikimedia.org/r/554125 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [21:45:24] (03CR) 10Dzahn: "added to DNS now" [puppet] - 10https://gerrit.wikimedia.org/r/553424 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [21:48:29] (03PS6) 10Bstorm: toolforge-k8s: simplify calico upgrades and distribute calicoctl [puppet] - 10https://gerrit.wikimedia.org/r/553418 [21:51:09] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: simplify calico upgrades and distribute calicoctl [puppet] - 10https://gerrit.wikimedia.org/r/553418 (owner: 10Bstorm) [21:56:08] (03PS1) 10Cwhite: admin: add cohi to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/554163 (https://phabricator.wikimedia.org/T234429) [21:57:55] (03PS1) 10Ladsgroup: Set item term store for read new up to Q1000 for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554164 (https://phabricator.wikimedia.org/T225057) [21:59:21] (03CR) 10Cwhite: [C: 03+2] admin: add cohi to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/554163 (https://phabricator.wikimedia.org/T234429) (owner: 10Cwhite) [21:59:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [21:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:00:04] Reedy and sbassett: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191202T2200). [22:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:36] (03CR) 10Dzahn: airflow: move parameters, use lookup, style changes (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [22:00:49] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:00] (03PS5) 10Dzahn: airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) [22:01:09] bstorm_: toolforge-k8s: simplify calico upgrades and distribute calicoctl (5f468dd164) good to deploy? [22:01:25] Yup! sorry, got distrated before running the merge [22:01:31] *distracted [22:01:33] 👍 [22:01:35] (03CR) 10jerkins-bot: [V: 04-1] airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [22:01:42] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:49] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:52] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/MachineVision: Update text for no personal uploads message (T238873) (duration: 01m 03s) [22:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:57] T238873: Add a specific "you haven't uploaded any files yet" view to the CAT tool Uploads tab - https://phabricator.wikimedia.org/T238873 [22:08:07] PROBLEM - mediawiki-installation DSH group on mw2251 is CRITICAL: Host mw2251 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:10:29] PROBLEM - nutcracker process on mw2251 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.80: Connection reset by peer https://wikitech.wikimedia.org/wiki/Nutcracker [22:11:14] mw2251 - something went wrong with the reimaging (script) [22:11:47] (03PS6) 10Dzahn: airflow: move parameters, use lookup, style changes [puppet] - 10https://gerrit.wikimedia.org/r/553384 (https://phabricator.wikimedia.org/T236180) [22:12:57] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 (10Bstorm) [22:13:09] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 (10Bstorm) a:03wiki_willy [22:13:39] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 (10Bstorm) Please coordinate with me for restarts and downtime. [22:17:32] PROBLEM - PHP opcache health on mw2245 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:18:38] RECOVERY - mediawiki-installation DSH group on mw2247 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:20:45] 10Operations, 10Traffic, 10netops: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10BBlack) [22:23:21] (03PS2) 10Dzahn: ATS: use TLS to noc.wikimedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/553199 (https://phabricator.wikimedia.org/T210411) [22:23:48] 10Operations, 10Traffic, 10netops: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10BBlack) pdns-rec-exporter fixups in: https://gerrit.wikimedia.org/r/#/c/operations/debs/prometheus-pdns-rec-exporter/+/554155/ + https://gerrit.wikimedia.org/r/#/c/operations/debs/prometheus-pdns-r... [22:28:04] (03PS1) 10Bstorm: Revert "toolforge-k8s: simplify calico upgrades and distribute calicoctl" [puppet] - 10https://gerrit.wikimedia.org/r/554174 [22:28:17] (03Abandoned) 10Bstorm: Revert "toolforge-k8s: simplify calico upgrades and distribute calicoctl" [puppet] - 10https://gerrit.wikimedia.org/r/554174 (owner: 10Bstorm) [22:28:47] PROBLEM - Wikitech and wt-static content in sync on labweb1001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (201591s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [22:28:47] PROBLEM - Wikitech and wt-static content in sync on labweb1002 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (201591s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [22:32:34] (03PS1) 10BBlack: dns4002: set PXE to buster install [puppet] - 10https://gerrit.wikimedia.org/r/554175 (https://phabricator.wikimedia.org/T239667) [22:34:34] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2248.codfw.wmnet'] ` and were **ALL** successful. [22:34:54] (03CR) 10BBlack: [C: 03+2] dns4002: set PXE to buster install [puppet] - 10https://gerrit.wikimedia.org/r/554175 (https://phabricator.wikimedia.org/T239667) (owner: 10BBlack) [22:37:40] RECOVERY - PHP opcache health on mw2245 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:38:55] (03PS1) 10Dzahn: ssl: add noc.wikimedia.org to mwmaint puppet TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/554177 (https://phabricator.wikimedia.org/T210411) [22:39:55] (03PS2) 10Dzahn: ssl: add noc.wikimedia.org to mwmaint puppet TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/554177 (https://phabricator.wikimedia.org/T210411) [22:39:57] (03CR) 10Dzahn: [C: 03+2] ssl: add noc.wikimedia.org to mwmaint puppet TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/554177 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [22:41:52] (03CR) 10Bstorm: "This has broken toolforge kubernetes (the old, existing cluster) because this arg (--metrics-bind-address) is not valid in that version of" [puppet] - 10https://gerrit.wikimedia.org/r/554036 (owner: 10Alexandros Kosiaris) [22:42:30] PROBLEM - Host mw2251 is DOWN: PING CRITICAL - Packet loss = 100% [22:42:52] RECOVERY - Host mw2251 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [22:43:04] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2249.codfw.wmnet'] ` and were **ALL** successful. [22:43:54] RECOVERY - nutcracker process on mw2251 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [22:44:37] !log reimaging dns4002 to buster - T239667 [22:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:43] T239667: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 [22:45:07] (03CR) 10Dzahn: "this works now:" [puppet] - 10https://gerrit.wikimedia.org/r/554177 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [22:45:12] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns4002.wikimedia.org'] ` The log can be found in `/var/log/wmf-auto... [22:45:58] (03PS3) 10Dzahn: ATS: use TLS to noc.wikimedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/553199 (https://phabricator.wikimedia.org/T210411) [22:46:49] (03CR) 10Dzahn: "This works now:" [puppet] - 10https://gerrit.wikimedia.org/r/553199 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [22:47:12] (03CR) 10Dzahn: [C: 03+2] ATS: use TLS to noc.wikimedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/553199 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [22:47:48] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:47:54] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:48:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={bird,pdnsrec} site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:48:57] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2251.codfw.wmnet'] ` and were **ALL** successful. [22:51:00] ACKNOWLEDGEMENT - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 daniel_zahn BFD neighbor IP is the DNS server currently being reimaged https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:51:21] ah yes [22:51:25] ACKNOWLEDGEMENT - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 daniel_zahn BFD neighbor IP is the DNS server currently being reimaged https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:51:28] I hadn't thought about the BFD, thanks! [22:51:37] np, yw [22:51:43] (03PS1) 10Bstorm: kube-proxy: Fix toolforge kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/554178 (https://phabricator.wikimedia.org/T239670) [22:54:55] i switched noc.wikimedia.org to use TLS behind ATS (envoy on mwmaint1002) [22:55:34] (03CR) 10Alex Monk: [C: 03+1] kube-proxy: Fix toolforge kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/554178 (https://phabricator.wikimedia.org/T239670) (owner: 10Bstorm) [22:57:12] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [22:57:59] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) https://noc.wikimedia.org has been switched to use https://mwmaint.discovery.wmnet (envoy on mwmaint1002). [22:59:45] (03PS2) 10Bstorm: kube-proxy: Fix toolforge kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/554178 (https://phabricator.wikimedia.org/T239670) [22:59:47] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:02:51] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [23:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:02] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:05:04] !log mw2248 - restart nginx (for some reason unit was running but not listening on 443 after reimage..now it does) [23:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:18] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2248.codfw.wmnet [23:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:42] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2249.codfw.wmnet [23:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:08] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10Rxy) >>! In T239494#5704445, @jbond wrote: > @Rxy Before enabling access we will first need to ensure we have a valid signed [[ https://wikitech.wikimedia.org/wiki/Volunteer_NDA | NDA ]] on... [23:11:45] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10Rxy) >>! In T239494#5704445, @jbond wrote: > @Rxy Before enabling access we will first need to ensure we have a valid signed [[ https://wikitech.wikimedia.org/wiki/Volunteer_NDA | NDA ]] on... [23:13:04] (03PS3) 10Bstorm: kube-proxy: Fix toolforge kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/554178 (https://phabricator.wikimedia.org/T239670) [23:14:07] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:15:09] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:15:23] (03CR) 10Bstorm: "https://puppet-compiler.wmflabs.org/compiler1003/19728/tools-proxy-05.tools.eqiad.wmflabs/ Fixed in toolforge." [puppet] - 10https://gerrit.wikimedia.org/r/554178 (https://phabricator.wikimedia.org/T239670) (owner: 10Bstorm) [23:15:45] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5042 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [23:16:41] PROBLEM - Recursive DNS on 198.35.26.8 is CRITICAL: Return code of 255 is out of bounds https://wikitech.wikimedia.org/wiki/DNS [23:17:37] PROBLEM - Wikitech and wt-static content in sync on cloudweb2001-dev is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (204019s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [23:17:46] (03CR) 10Bstorm: "No change on prod https://puppet-compiler.wmflabs.org/compiler1003/19729/" [puppet] - 10https://gerrit.wikimedia.org/r/554178 (https://phabricator.wikimedia.org/T239670) (owner: 10Bstorm) [23:18:01] that downtime doesn't last long I guess [23:18:07] (03CR) 10Alex Monk: [C: 03+1] kube-proxy: Fix toolforge kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/554178 (https://phabricator.wikimedia.org/T239670) (owner: 10Bstorm) [23:18:21] well, or it just doesn't associate with the IP-based check of course [23:18:27] either way... [23:19:21] we could run puppet on icinga and reschedule the check, but probably not worth it [23:20:39] yeah give me a few and it'll all clear up I think [23:20:56] (03PS4) 10Bstorm: kube-proxy: Fix toolforge kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/554178 (https://phabricator.wikimedia.org/T239670) [23:20:58] (03PS1) 10BBlack: dns: clamp SOA TTLs to ncache TTLs [dns] - 10https://gerrit.wikimedia.org/r/554185 [23:21:30] (03CR) 10BBlack: [C: 03+2] dns: clamp SOA TTLs to ncache TTLs [dns] - 10https://gerrit.wikimedia.org/r/554185 (owner: 10BBlack) [23:24:47] (03PS1) 10BBlack: wikimedia.org: has 600 for the ncache... [dns] - 10https://gerrit.wikimedia.org/r/554186 [23:25:22] (03CR) 10BBlack: [C: 03+2] wikimedia.org: has 600 for the ncache... [dns] - 10https://gerrit.wikimedia.org/r/554186 (owner: 10BBlack) [23:26:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdnsrec site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:27:12] (03PS5) 10Bstorm: kube-proxy: Fix toolforge kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/554178 (https://phabricator.wikimedia.org/T239670) [23:31:21] RECOVERY - Recursive DNS on 198.35.26.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [23:31:25] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns4002.wikimedia.org'] ` and were **ALL** successful. [23:33:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:34:58] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): cloudstore1008 crash - Memory error - https://phabricator.wikimedia.org/T239569 (10wiki_willy) a:05wiki_willy→03Jclark-ctr Re-assigning to @Jclark-ctr , who can coordinate with @JHedden during their weekly sync up on Tuesdays.... [23:36:14] (03PS6) 10Bstorm: kube-proxy: Fix toolforge kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/554178 (https://phabricator.wikimedia.org/T239670) [23:37:22] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2251.codfw.wmnet [23:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:31] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.03333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [23:38:03] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2250.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912022337_dzahn_132438_mw22... [23:38:41] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2252.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912022338_dzahn_132541_mw22... [23:39:02] (03CR) 10Alex Monk: [C: 03+1] kube-proxy: Fix toolforge kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/554178 (https://phabricator.wikimedia.org/T239670) (owner: 10Bstorm) [23:42:36] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2253.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912022342_dzahn_132970_mw22... [23:43:08] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2254.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201912022343_dzahn_133083_mw22... [23:44:23] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [23:46:03] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28544 bytes in 3.085 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [23:49:52] (03CR) 10Dzahn: [C: 03+2] installserver: add gerrit1002 with flat/VM partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/553438 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [23:50:31] (03CR) 10Bstorm: [C: 03+2] kube-proxy: Fix toolforge kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/554178 (https://phabricator.wikimedia.org/T239670) (owner: 10Bstorm) [23:53:56] (03PS2) 10Dzahn: installserver: add gerrit1002 with flat/VM partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/553438 (https://phabricator.wikimedia.org/T239151) [23:55:16] (03CR) 10Dzahn: [C: 03+2] installserver: add gerrit1002 with flat/VM partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/553438 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [23:58:14] (03CR) 10Dzahn: [C: 03+2] assign IPs for gerrit1002 and gerrit-test [dns] - 10https://gerrit.wikimedia.org/r/553437 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [23:58:19] (03PS5) 10Dzahn: assign IPs for gerrit1002 and gerrit-test [dns] - 10https://gerrit.wikimedia.org/r/553437 (https://phabricator.wikimedia.org/T239151)