[00:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200204T0000). [00:00:04] urandom and James_F: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:15] I'll SWAT. [00:00:34] o/ [00:00:42] (03CR) 10Jforrester: [C: 03+2] Configure remainder of testwikis group for kask-session [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569609 (https://phabricator.wikimedia.org/T243106) (owner: 10Eevans) [00:00:52] urandom: Is this testable or just a deploy-and-see? [00:01:20] I can sanity test it, yeah [00:01:28] Cool. [00:01:35] make sure I don't get logged off or something [00:01:45] (03Merged) 10jenkins-bot: Configure remainder of testwikis group for kask-session [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569609 (https://phabricator.wikimedia.org/T243106) (owner: 10Eevans) [00:01:56] urandom: Live on mwdebug1001. [00:02:37] (03PS2) 10Jforrester: [nlwiki] Enable VisualEditor by default for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566866 (https://phabricator.wikimedia.org/T161365) [00:02:41] James_F: yup, sanity tested; LGTM [00:02:46] !log depool, varnish-frontend-restart, pool on cp4029 (~242k fds) - T243634 [00:02:47] Kk. [00:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:50] T243634: ulsfo varinsh-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 [00:03:04] (03CR) 10Jforrester: [C: 03+2] [nlwiki] Enable VisualEditor by default for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566866 (https://phabricator.wikimedia.org/T161365) (owner: 10Jforrester) [00:03:48] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Configure remainder of testwikis group for kask-session T243106 (duration: 00m 58s) [00:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:51] T243106: Phased rollout of sessionstore to production fleet - https://phabricator.wikimedia.org/T243106 [00:03:59] (03Merged) 10jenkins-bot: [nlwiki] Enable VisualEditor by default for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566866 (https://phabricator.wikimedia.org/T161365) (owner: 10Jforrester) [00:05:16] !log gerrit1002 - attempt to manually fix /etc/network interfaces , add IP on interface, reboot [00:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:15] !log jforrester@deploy1001 Synchronized dblists/visualeditor-nondefault.dblist: [nlwiki] Enable VisualEditor by default for all users T161365 (duration: 00m 58s) [00:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:18] T161365: Enable VisualEditor by default for all users of the Dutch Wikipedia - https://phabricator.wikimedia.org/T161365 [00:09:22] !log gerrit1002 - replaced ens5 with ens6 in /etc/network/interfaces (IP and row had changed in the past, needed manual fix after reboot and now came back) ; mkfs.ext4 /dev/vdb on new additional 10GB disk. (T239151 T243983) [00:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:27] T239151: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 [00:09:28] T243983: Add second virtual hard disk to ganeti gerrit test instance - https://phabricator.wikimedia.org/T243983 [00:10:09] (03PS2) 10Jforrester: scap: Restructure tox.ini so it's easier to test against Python 3 in the future [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566409 (owner: 10Legoktm) [00:10:15] (03CR) 10Jforrester: [C: 03+2] scap: Restructure tox.ini so it's easier to test against Python 3 in the future [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566409 (owner: 10Legoktm) [00:10:42] (03PS1) 10Eevans: Configure group0 for kask-transition (multi-write kask/redis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569678 (https://phabricator.wikimedia.org/T243106) [00:11:09] (03Merged) 10jenkins-bot: scap: Restructure tox.ini so it's easier to test against Python 3 in the future [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566409 (owner: 10Legoktm) [00:11:44] (03PS2) 10Jforrester: Add Commons to enwiki importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567305 (https://phabricator.wikimedia.org/T242884) (owner: 10Ammarpad) [00:11:53] (03CR) 10Jforrester: [C: 03+2] Add Commons to enwiki importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567305 (https://phabricator.wikimedia.org/T242884) (owner: 10Ammarpad) [00:12:39] (03CR) 10Jforrester: [C: 03+1] "Ha." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566410 (owner: 10Legoktm) [00:12:53] (03Merged) 10jenkins-bot: Add Commons to enwiki importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567305 (https://phabricator.wikimedia.org/T242884) (owner: 10Ammarpad) [00:14:22] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [enwiki] Add Commons as an import source T242884 (duration: 00m 57s) [00:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:26] T242884: Add commonswiki as a transwiki importsource on enwiki - https://phabricator.wikimedia.org/T242884 [00:15:40] (03CR) 10Jforrester: [C: 03+1] "Good to go once wmf.18 is everywhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567155 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [00:15:48] OK, prod cleear. [00:15:59] Also, clear. I've been reading too much Dutch. [00:18:02] cleearly. [00:30:23] Thanks @James_F for merging T242884! [00:30:23] T242884: Add commonswiki as a transwiki importsource on enwiki - https://phabricator.wikimedia.org/T242884 [00:30:25] :) [00:30:39] 10Operations: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) a:03Dzahn [00:31:45] TheSandDoctor: Happy to help. [00:32:03] 10Operations: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [00:32:08] Sorry you had to go through the process; I didn't really mean for a whole vote to be needed. [00:32:16] (03PS1) 10Dzahn: add IP addresses for new install servers on buster [dns] - 10https://gerrit.wikimedia.org/r/569679 (https://phabricator.wikimedia.org/T224576) [00:33:03] Well, it’s required from what I understood. Thankfully unanimous support @James_F [00:33:05] :) [00:33:14] I understood/learned [00:33:19] * [00:33:49] We require that there was a brief chat and no-one screamed. We only need a large-scale RfC/etc. when it's a big change. Oh well. :-) [00:34:59] (03PS2) 10Dzahn: add IP addresses for new install servers on buster [dns] - 10https://gerrit.wikimedia.org/r/569679 (https://phabricator.wikimedia.org/T224576) [00:35:26] (03PS3) 10Dzahn: add IP addresses for new install servers on buster [dns] - 10https://gerrit.wikimedia.org/r/569679 (https://phabricator.wikimedia.org/T224576) [00:38:06] (03PS1) 10Dzahn: switch webproxy to new install server IPs [dns] - 10https://gerrit.wikimedia.org/r/569680 (https://phabricator.wikimedia.org/T224576) [00:40:27] (03PS2) 10Dzahn: switch webproxy CNAMEs to new install servers [dns] - 10https://gerrit.wikimedia.org/r/569680 (https://phabricator.wikimedia.org/T224576) [00:41:21] (03PS1) 10Dzahn: switch apt.wikimedia.org from install1002 to install1003 [dns] - 10https://gerrit.wikimedia.org/r/569682 (https://phabricator.wikimedia.org/T224576) [00:42:57] (03PS1) 10Dzahn: remove install1002/install2002 [dns] - 10https://gerrit.wikimedia.org/r/569683 (https://phabricator.wikimedia.org/T224576) [00:47:58] (03PS1) 10Dzahn: install_server: replace "next" server in DHCP with new install servers [puppet] - 10https://gerrit.wikimedia.org/r/569684 (https://phabricator.wikimedia.org/T224576) [00:52:27] (03PS1) 10Dzahn: DHCP: add install1003/install2003 to use old install servers at first [puppet] - 10https://gerrit.wikimedia.org/r/569685 (https://phabricator.wikimedia.org/T224576) [00:53:41] (03CR) 10jerkins-bot: [V: 04-1] DHCP: add install1003/install2003 to use old install servers at first [puppet] - 10https://gerrit.wikimedia.org/r/569685 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [00:56:20] (03PS1) 10Dzahn: DHCP: remove old install servers and use new servers as next-server [puppet] - 10https://gerrit.wikimedia.org/r/569686 (https://phabricator.wikimedia.org/T224576) [00:57:21] (03PS1) 10Dzahn: switch install_server and failover in hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/569687 (https://phabricator.wikimedia.org/T224576) [00:57:31] (03CR) 10jerkins-bot: [V: 04-1] DHCP: remove old install servers and use new servers as next-server [puppet] - 10https://gerrit.wikimedia.org/r/569686 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [00:59:46] (03PS1) 10Dzahn: wmflib: replace install1002 with install1003 in ipresolve_spec [puppet] - 10https://gerrit.wikimedia.org/r/569688 (https://phabricator.wikimedia.org/T224576) [01:02:19] 10Operations, 10Research, 10Traffic: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10bmansurov) @Vgutierrez, thanks for working on this task. Please let me know if I can help move the task forward. [01:02:36] (03PS2) 10Dzahn: install_server: replace "next" server in DHCP with new install servers [puppet] - 10https://gerrit.wikimedia.org/r/569684 (https://phabricator.wikimedia.org/T224576) [01:03:53] (03CR) 10jerkins-bot: [V: 04-1] wmflib: replace install1002 with install1003 in ipresolve_spec [puppet] - 10https://gerrit.wikimedia.org/r/569688 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [01:10:40] (03PS4) 10Krinkle: arclamp-log: abort if no message received in 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/568732 (https://phabricator.wikimedia.org/T215740) (owner: 10Ori.livneh) [01:40:29] (03PS1) 10Dzahn: install_server: allow rsyncing from active to replacement servers [puppet] - 10https://gerrit.wikimedia.org/r/569691 (https://phabricator.wikimedia.org/T224576) [01:41:36] (03CR) 10jerkins-bot: [V: 04-1] install_server: allow rsyncing from active to replacement servers [puppet] - 10https://gerrit.wikimedia.org/r/569691 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [01:45:15] (03CR) 10Dave Pifke: [C: 03+1] "LGTM. Also verified Restart=Always is present in systemd unit file so this will have the desired effect." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/568732 (https://phabricator.wikimedia.org/T215740) (owner: 10Ori.livneh) [01:47:43] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1113.66 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:06:45] (03CR) 10Legoktm: "We could do that - though it's unclear to me though whether this code is actually being used and what exactly the value is for having this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566410 (owner: 10Legoktm) [02:56:40] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for wdqs200[7-8] [dns] - 10https://gerrit.wikimedia.org/r/566920 (owner: 10Papaul) [02:57:03] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for wdqs200[7-8] [dns] - 10https://gerrit.wikimedia.org/r/566920 [03:08:03] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:26:16] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10Papaul) ` apaul@asw-b-codfw# show | compare [edit interfaces interface-range vlan-cloud-support1-b-codfw] - member-range ge-1/0/0 to ge-... [03:27:33] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T243329 (10Papaul) [03:29:01] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10cloud-services-team (Hardware): decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces interface-range vlan-cloud-support1-b-cod... [03:42:47] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:48:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:58:48] 10Operations, 10ops-codfw, 10netops: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10Papaul) [04:00:43] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10cloud-services-team (Hardware): decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Papaul) I did not remove ge-8/0/9 from interface-range vlan-cloud-support1-b-codfw see T244196 [04:01:03] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10cloud-services-team (Hardware): decommission labstore2003.codfw.wmnet and labstore2004.codfw.wmnet - https://phabricator.wikimedia.org/T243319 (10Papaul) [04:12:09] (03PS1) 10Papaul: DNS: Remove mgmt DNS for frdb2001 and alnitak [dns] - 10https://gerrit.wikimedia.org/r/569701 [04:20:50] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for frdb2001 and alnitak [dns] - 10https://gerrit.wikimedia.org/r/569701 (owner: 10Papaul) [04:22:03] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission frdb2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242983 (10Papaul) [04:22:15] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission frdb2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242983 (10Papaul) 05Open→03Resolved complete [04:22:50] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission alnitak.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242990 (10Papaul) [04:23:05] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission alnitak.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242990 (10Papaul) 05Open→03Resolved a:05Papaul→03None complete [04:25:54] 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10Papaul) 05Open→03Resolved Closing this task since we haven;t seen any errors since November [04:28:27] 10Operations, 10ops-codfw: ms-be2030.mgmt is down - https://phabricator.wikimedia.org/T243646 (10Papaul) 05Open→03Resolved a:03Papaul This is has been fixed [04:31:23] 10Operations, 10ops-codfw, 10Discovery: elastic2043 has hardware errors that trigger reboots - https://phabricator.wikimedia.org/T243715 (10Papaul) a:03Papaul [05:07:57] 10Operations, 10ops-codfw, 10Discovery: elastic2043 has hardware errors that trigger reboots - https://phabricator.wikimedia.org/T243715 (10Papaul) 05Open→03Resolved Upgrade CPLD to version 1.0.10 no more errors. system is back up running [05:33:33] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 82 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [06:28:55] PROBLEM - Check systemd state on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:57] PROBLEM - Check whether ferm is active by checking the default input chain on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:01] PROBLEM - Disk space on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops [06:29:05] PROBLEM - configured eth on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:05] PROBLEM - MD RAID on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:05] PROBLEM - MD RAID on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:05] PROBLEM - MD RAID on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:06] PROBLEM - MD RAID on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:06] PROBLEM - Check size of conntrack table on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:06] PROBLEM - MD RAID on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:07] PROBLEM - puppet last run on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:29:09] PROBLEM - Disk space on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2005&var-datasource=codfw+prometheus/ops [06:29:11] PROBLEM - configured eth on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:11] PROBLEM - configured eth on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:11] PROBLEM - Check whether ferm is active by checking the default input chain on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:11] PROBLEM - Check systemd state on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:13] PROBLEM - Check systemd state on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:15] PROBLEM - Disk space on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2002&var-datasource=codfw+prometheus/ops [06:29:15] PROBLEM - configured eth on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:15] PROBLEM - Disk space on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2009&var-datasource=codfw+prometheus/ops [06:29:17] PROBLEM - ores uWSGI web app on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:19] PROBLEM - dhclient process on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:21] PROBLEM - Check size of conntrack table on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:23] PROBLEM - Disk space on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1006&var-datasource=eqiad+prometheus/ops [06:29:23] PROBLEM - dhclient process on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:23] PROBLEM - dhclient process on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:29] PROBLEM - dhclient process on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:29] PROBLEM - Check systemd state on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:29] PROBLEM - Check systemd state on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:29] PROBLEM - MD RAID on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:33] PROBLEM - Disk space on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1009&var-datasource=eqiad+prometheus/ops [06:29:33] PROBLEM - Check whether ferm is active by checking the default input chain on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:33] PROBLEM - DPKG on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:33] PROBLEM - MD RAID on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:35] PROBLEM - MD RAID on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:35] PROBLEM - Check whether ferm is active by checking the default input chain on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:35] PROBLEM - Check whether ferm is active by checking the default input chain on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:37] PROBLEM - Check size of conntrack table on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:37] PROBLEM - Check size of conntrack table on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:37] PROBLEM - ores uWSGI web app on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:37] PROBLEM - DPKG on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:37] PROBLEM - Check systemd state on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:37] PROBLEM - configured eth on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:38] PROBLEM - DPKG on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:39] PROBLEM - MD RAID on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:39] PROBLEM - Disk space on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2008&var-datasource=codfw+prometheus/ops [06:29:39] PROBLEM - Check size of conntrack table on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:41] PROBLEM - configured eth on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:41] PROBLEM - dhclient process on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:43] PROBLEM - dhclient process on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:43] PROBLEM - dhclient process on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:43] PROBLEM - MD RAID on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:43] PROBLEM - ores uWSGI web app on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:43] PROBLEM - Check size of conntrack table on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:47] PROBLEM - MD RAID on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:49] PROBLEM - configured eth on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:51] PROBLEM - puppet last run on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:29:51] PROBLEM - dhclient process on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:51] PROBLEM - Check size of conntrack table on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:51] PROBLEM - Check size of conntrack table on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:51] PROBLEM - Disk space on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2004&var-datasource=codfw+prometheus/ops [06:29:55] PROBLEM - Check systemd state on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:55] PROBLEM - puppet last run on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:29:55] PROBLEM - Check whether ferm is active by checking the default input chain on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:55] PROBLEM - Check size of conntrack table on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:55] PROBLEM - Check size of conntrack table on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:55] PROBLEM - Check systemd state on ores1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:57] PROBLEM - DPKG on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:57] PROBLEM - DPKG on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:57] PROBLEM - dhclient process on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:57] PROBLEM - DPKG on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:58] PROBLEM - Check systemd state on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:58] PROBLEM - DPKG on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:59] PROBLEM - Check size of conntrack table on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:59] PROBLEM - ores uWSGI web app on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:59] PROBLEM - ores uWSGI web app on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:03] PROBLEM - Check whether ferm is active by checking the default input chain on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:05] PROBLEM - configured eth on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:05] PROBLEM - ores uWSGI web app on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:05] PROBLEM - ores uWSGI web app on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:05] PROBLEM - Disk space on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1004&var-datasource=eqiad+prometheus/ops [06:30:05] PROBLEM - Check systemd state on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:05] PROBLEM - Check systemd state on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:06] PROBLEM - configured eth on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:07] PROBLEM - MD RAID on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:09] PROBLEM - MD RAID on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:11] PROBLEM - DPKG on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:11] PROBLEM - Check systemd state on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:13] PROBLEM - configured eth on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:13] PROBLEM - dhclient process on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:13] PROBLEM - Check whether ferm is active by checking the default input chain on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:15] PROBLEM - ores uWSGI web app on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:15] PROBLEM - ores uWSGI web app on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:15] PROBLEM - dhclient process on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:15] PROBLEM - Check whether ferm is active by checking the default input chain on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:15] PROBLEM - DPKG on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:17] PROBLEM - configured eth on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:19] PROBLEM - configured eth on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:21] PROBLEM - DPKG on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:21] PROBLEM - ores uWSGI web app on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:23] PROBLEM - Check systemd state on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:23] PROBLEM - Disk space on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2001&var-datasource=codfw+prometheus/ops [06:30:23] PROBLEM - dhclient process on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:25] PROBLEM - dhclient process on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:25] PROBLEM - dhclient process on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:25] PROBLEM - DPKG on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:25] PROBLEM - DPKG on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:25] PROBLEM - Disk space on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops [06:30:29] PROBLEM - puppet last run on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:30:29] PROBLEM - Check whether ferm is active by checking the default input chain on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:29] PROBLEM - Check size of conntrack table on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:31] PROBLEM - DPKG on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:33] PROBLEM - Check whether ferm is active by checking the default input chain on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:33] PROBLEM - Check systemd state on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:35] PROBLEM - MD RAID on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:35] PROBLEM - puppet last run on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:30:35] PROBLEM - Disk space on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1001&var-datasource=eqiad+prometheus/ops [06:30:35] PROBLEM - ores on ores1004 is CRITICAL: connect to address 10.64.16.95 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:30:35] PROBLEM - Check size of conntrack table on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:37] PROBLEM - Disk space on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1002&var-datasource=eqiad+prometheus/ops [06:30:41] PROBLEM - Check systemd state on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:41] PROBLEM - Disk space on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2003&var-datasource=codfw+prometheus/ops [06:30:41] PROBLEM - Check whether ferm is active by checking the default input chain on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:41] PROBLEM - Check whether ferm is active by checking the default input chain on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:41] PROBLEM - ores uWSGI web app on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:43] PROBLEM - DPKG on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:43] PROBLEM - ores uWSGI web app on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:43] PROBLEM - Check whether ferm is active by checking the default input chain on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:45] PROBLEM - ores uWSGI web app on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:45] PROBLEM - configured eth on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:47] PROBLEM - ores uWSGI web app on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:47] PROBLEM - Check systemd state on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:57] PROBLEM - Check whether ferm is active by checking the default input chain on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:57] RECOVERY - Disk space on ores1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops [06:31:01] PROBLEM - puppet last run on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:31:07] RECOVERY - Check whether ferm is active by checking the default input chain on ores1008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:31:07] PROBLEM - Check size of conntrack table on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:31:23] PROBLEM - puppet last run on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:31:23] RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:35] RECOVERY - MD RAID on ores2006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:31:43] RECOVERY - dhclient process on ores2006 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:31:43] RECOVERY - Check size of conntrack table on ores2006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:31:53] PROBLEM - puppet last run on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:31:57] RECOVERY - Check whether ferm is active by checking the default input chain on ores2006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:31:57] RECOVERY - Check systemd state on ores2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:03] RECOVERY - DPKG on ores2006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:32:23] PROBLEM - puppet last run on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:31] RECOVERY - Check whether ferm is active by checking the default input chain on ores2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:32:39] RECOVERY - Check systemd state on ores2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:47] PROBLEM - puppet last run on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:51] RECOVERY - MD RAID on ores2008 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:32:51] RECOVERY - MD RAID on ores2002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:32:59] RECOVERY - Disk space on ores2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2002&var-datasource=codfw+prometheus/ops [06:32:59] RECOVERY - configured eth on ores2002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:33:07] RECOVERY - dhclient process on ores2002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:33:17] PROBLEM - puppet last run on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:23] RECOVERY - Disk space on ores2008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2008&var-datasource=codfw+prometheus/ops [06:33:25] RECOVERY - dhclient process on ores2008 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:33:31] PROBLEM - puppet last run on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:35] RECOVERY - Check size of conntrack table on ores2008 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:33:37] RECOVERY - Check systemd state on ores2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:39] RECOVERY - DPKG on ores2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:33:41] RECOVERY - Check size of conntrack table on ores2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:33:47] RECOVERY - configured eth on ores2008 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:34:13] RECOVERY - DPKG on ores2008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:34:21] RECOVERY - Check systemd state on ores1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:25] RECOVERY - Check whether ferm is active by checking the default input chain on ores2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:34:32] RECOVERY - Check systemd state on ores2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:37] RECOVERY - Check whether ferm is active by checking the default input chain on ores1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:34:45] RECOVERY - configured eth on ores2009 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:34:51] RECOVERY - Disk space on ores2009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2009&var-datasource=codfw+prometheus/ops [06:35:03] RECOVERY - MD RAID on ores2009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:35:09] RECOVERY - Check whether ferm is active by checking the default input chain on ores2009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:35:11] RECOVERY - Check size of conntrack table on ores2009 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:35:11] RECOVERY - configured eth on ores1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:35:11] RECOVERY - Check size of conntrack table on ores1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:35:15] RECOVERY - dhclient process on ores2009 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:35:21] PROBLEM - Check the NTP synchronisation status of timesyncd on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:35:37] RECOVERY - Disk space on ores1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1004&var-datasource=eqiad+prometheus/ops [06:35:41] RECOVERY - MD RAID on ores1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:35:45] RECOVERY - dhclient process on ores1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:35:47] RECOVERY - DPKG on ores1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:35:57] RECOVERY - DPKG on ores2009 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:36:07] RECOVERY - ores on ores1004 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:36:25] RECOVERY - puppet last run on ores2001 is OK: OK: Puppet is currently enabled, last run 25 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:36:27] PROBLEM - Check the NTP synchronisation status of timesyncd on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:36:33] RECOVERY - MD RAID on ores2001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:36:35] RECOVERY - configured eth on ores2001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:36:49] PROBLEM - Check the NTP synchronisation status of timesyncd on ores2003 is CRITICAL: connect to address 10.192.16.63 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:37:01] RECOVERY - Check size of conntrack table on ores2001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:37:01] RECOVERY - DPKG on ores2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:37:05] RECOVERY - dhclient process on ores2001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:37:43] RECOVERY - Disk space on ores2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2001&var-datasource=codfw+prometheus/ops [06:38:01] RECOVERY - Check whether ferm is active by checking the default input chain on ores2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:38:13] RECOVERY - puppet last run on ores2008 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:39:05] RECOVERY - puppet last run on ores2009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:39:05] RECOVERY - Check size of conntrack table on ores1002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:39:09] RECOVERY - dhclient process on ores1002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:39:09] RECOVERY - DPKG on ores1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:39:17] RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:17] RECOVERY - configured eth on ores2003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:39:19] RECOVERY - puppet last run on ores1004 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:39:25] RECOVERY - Check whether ferm is active by checking the default input chain on ores1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:39:37] RECOVERY - dhclient process on ores2003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:39:41] RECOVERY - Check whether ferm is active by checking the default input chain on ores2003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:39:47] RECOVERY - Disk space on ores1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1002&var-datasource=eqiad+prometheus/ops [06:39:51] RECOVERY - Disk space on ores2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2003&var-datasource=codfw+prometheus/ops [06:40:12] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10crusnov) Just a quick note, an extremely similar event has occurred today at 0630 UTC. The entire ORES cluster seems to have had oom issues for a while. [06:40:13] RECOVERY - Disk space on ores2005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2005&var-datasource=codfw+prometheus/ops [06:40:15] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:17] RECOVERY - Check size of conntrack table on ores2003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:40:27] RECOVERY - dhclient process on ores2005 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:40:27] PROBLEM - Check the NTP synchronisation status of timesyncd on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:40:39] RECOVERY - Check whether ferm is active by checking the default input chain on ores2005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:40:41] RECOVERY - Check systemd state on ores2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:41] RECOVERY - DPKG on ores2003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:40:41] RECOVERY - MD RAID on ores2003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:40:51] RECOVERY - MD RAID on ores1002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:40:51] RECOVERY - configured eth on ores1002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:41:01] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:17] RECOVERY - configured eth on ores2005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:41:23] RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:27] RECOVERY - DPKG on ores2005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:41:31] PROBLEM - Check the NTP synchronisation status of timesyncd on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:41:31] RECOVERY - Check size of conntrack table on ores2005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:41:35] RECOVERY - Disk space on ores1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1001&var-datasource=eqiad+prometheus/ops [06:41:35] RECOVERY - MD RAID on ores2005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:41:59] RECOVERY - configured eth on ores1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:42:01] RECOVERY - MD RAID on ores1006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:42:01] RECOVERY - Check size of conntrack table on ores1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:42:13] PROBLEM - ores_workers_running on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:42:17] RECOVERY - Disk space on ores1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1006&var-datasource=eqiad+prometheus/ops [06:42:23] PROBLEM - ores_workers_running on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:42:23] RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:25] PROBLEM - ores_workers_running on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:42:25] RECOVERY - Check whether ferm is active by checking the default input chain on ores1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:42:25] RECOVERY - DPKG on ores1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:42:25] RECOVERY - MD RAID on ores1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:42:43] RECOVERY - puppet last run on ores2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:43:05] RECOVERY - configured eth on ores1006 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:43:05] PROBLEM - ores_workers_running on ores1006 is CRITICAL: PROCS CRITICAL: 50 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:43:05] PROBLEM - ores_workers_running on ores1001 is CRITICAL: PROCS CRITICAL: 80 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:43:05] RECOVERY - dhclient process on ores1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:43:07] RECOVERY - Check whether ferm is active by checking the default input chain on ores1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:43:11] RECOVERY - DPKG on ores1006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:43:11] PROBLEM - ores_workers_running on ores1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:43:15] RECOVERY - dhclient process on ores1006 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:43:25] RECOVERY - Check size of conntrack table on ores1006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:43:31] RECOVERY - puppet last run on ores2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:44:25] RECOVERY - puppet last run on ores1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:44:53] RECOVERY - ores_workers_running on ores1006 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:44:53] RECOVERY - ores_workers_running on ores1001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:45:03] RECOVERY - dhclient process on ores1009 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:45:05] RECOVERY - Disk space on ores1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops [06:45:13] RECOVERY - Check whether ferm is active by checking the default input chain on ores1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:45:25] RECOVERY - configured eth on ores1005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:45:39] RECOVERY - MD RAID on ores1009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:45:45] RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:01] RECOVERY - dhclient process on ores1005 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:46:03] RECOVERY - Disk space on ores1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1009&var-datasource=eqiad+prometheus/ops [06:46:07] RECOVERY - MD RAID on ores1005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:46:13] RECOVERY - configured eth on ores2004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:46:15] RECOVERY - Check size of conntrack table on ores1009 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:46:23] RECOVERY - Disk space on ores2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2004&var-datasource=codfw+prometheus/ops [06:46:27] RECOVERY - Check whether ferm is active by checking the default input chain on ores1009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:46:27] RECOVERY - Check size of conntrack table on ores1005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:46:29] RECOVERY - DPKG on ores1005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:46:29] RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:29] RECOVERY - DPKG on ores1009 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:46:39] RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:46:39] RECOVERY - MD RAID on ores2004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:46:43] RECOVERY - Check systemd state on ores1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:49] RECOVERY - configured eth on ores1009 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:47:03] RECOVERY - Check systemd state on ores2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:13] RECOVERY - DPKG on ores2004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:47:23] RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:47:23] RECOVERY - Check whether ferm is active by checking the default input chain on ores2004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:47:27] RECOVERY - puppet last run on ores1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:47:43] RECOVERY - dhclient process on ores2004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:47:45] RECOVERY - Check size of conntrack table on ores2004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:47:53] RECOVERY - ores_workers_running on ores1009 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:47:55] RECOVERY - ores_workers_running on ores1005 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:48:01] RECOVERY - puppet last run on ores1009 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:48:41] RECOVERY - ores_workers_running on ores1003 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:48:48] !log force a puppet run on all ores[12] nodes [06:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:51] RECOVERY - puppet last run on ores2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:49:33] RECOVERY - ores_workers_running on ores2004 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:56:51] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10elukey) Created a dir on ores1002 with logs (`/home/elukey/0402020_celery_oom`). [06:59:34] (03PS1) 10Marostegui: Revert "db2085: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/569705 [07:01:00] (03CR) 10Marostegui: [C: 03+2] Revert "db2085: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/569705 (owner: 10Marostegui) [07:04:51] (03PS1) 10Marostegui: db1105,db2086: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/569706 (https://phabricator.wikimedia.org/T239453) [07:05:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3311, db2086:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10300 and previous config saved to /var/cache/conftool/dbconfig/20200204-070533-marostegui.json [07:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:38] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [07:06:11] RECOVERY - Check the NTP synchronisation status of timesyncd on ores2001 is OK: OK: synced at Tue 2020-02-04 07:06:10 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:06:36] (03CR) 10Marostegui: [C: 03+2] db1105,db2086: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/569706 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [07:07:15] RECOVERY - Check the NTP synchronisation status of timesyncd on ores1009 is OK: OK: synced at Tue 2020-02-04 07:07:14 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:07:37] RECOVERY - Check the NTP synchronisation status of timesyncd on ores2003 is OK: OK: synced at Tue 2020-02-04 07:07:36 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:08:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1091 - T232446', diff saved to https://phabricator.wikimedia.org/P10301 and previous config saved to /var/cache/conftool/dbconfig/20200204-070804-marostegui.json [07:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:08] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [07:08:59] !log Compress db1091 - T232446 [07:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:17] RECOVERY - Check the NTP synchronisation status of timesyncd on ores1001 is OK: OK: synced at Tue 2020-02-04 07:11:15 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:12:21] RECOVERY - Check the NTP synchronisation status of timesyncd on ores2004 is OK: OK: synced at Tue 2020-02-04 07:12:18 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:14:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1126 - T232446', diff saved to https://phabricator.wikimedia.org/P10302 and previous config saved to /var/cache/conftool/dbconfig/20200204-071420-marostegui.json [07:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:24] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [07:15:01] !log Compress db1126 - T232446 [07:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:51] (03PS1) 10Marostegui: db1091,db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/569717 (https://phabricator.wikimedia.org/T232446) [07:17:53] (03CR) 10Marostegui: [C: 03+2] db1091,db1126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/569717 (https://phabricator.wikimedia.org/T232446) (owner: 10Marostegui) [07:35:40] (03CR) 10Marostegui: [C: 03+2] control-mariadb-*: Change version [software] - 10https://gerrit.wikimedia.org/r/568978 (https://phabricator.wikimedia.org/T242702) (owner: 10Marostegui) [07:36:33] !log Upgrade Mariadb on db1107 from 10.4.11 to 10.4.12 T242702 [07:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:37] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [07:39:09] (03PS2) 10Alexandros Kosiaris: wikifeeds: Add tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/567100 [07:39:11] (03PS2) 10Alexandros Kosiaris: WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 [07:39:26] (03CR) 10jerkins-bot: [V: 04-1] WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 (owner: 10Alexandros Kosiaris) [07:44:24] (03PS3) 10Alexandros Kosiaris: WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 [07:53:53] 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Joe) p:05Normal→03High Can we please expedite this? We really need these servers to join rotation. [07:56:38] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10akosiaris) [[https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1580756738510&to=1580802703802&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ores&v... [07:59:22] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for nfacctd [puppet] - 10https://gerrit.wikimedia.org/r/569567 (https://phabricator.wikimedia.org/T135991) [08:02:23] (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: put memcached gutter hosts in cluster memcached_gutter [puppet] - 10https://gerrit.wikimedia.org/r/569579 (https://phabricator.wikimedia.org/T240684) (owner: 10Effie Mouzeli) [08:02:51] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for nfacctd [puppet] - 10https://gerrit.wikimedia.org/r/569567 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:07:23] (03PS1) 10Vgutierrez: dns-01: Support custom DNS server port [software/acme-chief] - 10https://gerrit.wikimedia.org/r/569754 (https://phabricator.wikimedia.org/T240614) [08:08:03] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops: siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Joe) p:05Triage→03High [08:08:09] (03PS4) 10Alexandros Kosiaris: WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 [08:10:58] (03CR) 10jerkins-bot: [V: 04-1] dns-01: Support custom DNS server port [software/acme-chief] - 10https://gerrit.wikimedia.org/r/569754 (https://phabricator.wikimedia.org/T240614) (owner: 10Vgutierrez) [08:13:46] !log Deploy schema change on test2wiki - T243804 [08:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:50] T243804: Apply: Add primary key to pagetriage_page_tags - https://phabricator.wikimedia.org/T243804 [08:16:07] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: move logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/566303 (owner: 10Filippo Giunchedi) [08:16:17] (03PS2) 10Filippo Giunchedi: swift: move logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/566303 [08:16:44] !log Deploy schema change on testwiki - T243804 [08:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:21] (03CR) 10Elukey: [C: 03+1] "I am ok with the change, even if the cluster name 'memcached_gutter' is a bit generic. Maybe memcached_wancache_gutter or similar could be" [puppet] - 10https://gerrit.wikimedia.org/r/569579 (https://phabricator.wikimedia.org/T240684) (owner: 10Effie Mouzeli) [08:22:42] (03PS4) 10Filippo Giunchedi: install_server: switch ms-fe to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566291 (https://phabricator.wikimedia.org/T156955) [08:23:58] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: switch ms-fe to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566291 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [08:24:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/566291 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [08:24:59] (03PS2) 10Vgutierrez: dns-01: Support custom DNS server port [software/acme-chief] - 10https://gerrit.wikimedia.org/r/569754 (https://phabricator.wikimedia.org/T240614) [08:25:01] (03PS1) 10Vgutierrez: acme_chief: Fix newly reported pylint R1724 (no-else-continue) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/569756 [08:27:59] (03PS1) 10Alexandros Kosiaris: scaffold: Remove last appbase_url_port mention [deployment-charts] - 10https://gerrit.wikimedia.org/r/569757 [08:30:29] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Itamar Givon to the ldap/wmde group - https://phabricator.wikimedia.org/T244148 (10ItamarWMDE) Hello @RStallman-legalteam, I signed the NDA you emailed me. What would be the next step? Thanks [08:30:43] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Itamar Givon to the ldap/wmde group - https://phabricator.wikimedia.org/T244148 (10ItamarWMDE) [08:32:17] (03PS5) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) [08:34:30] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [08:35:41] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 83 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [08:36:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: hiera: puppetmaster: refactor hiera (for VM instances) [puppet] - 10https://gerrit.wikimedia.org/r/569230 (https://phabricator.wikimedia.org/T229441) (owner: 10Arturo Borrero Gonzalez) [08:38:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] scaffold: Remove last appbase_url_port mention [deployment-charts] - 10https://gerrit.wikimedia.org/r/569757 (owner: 10Alexandros Kosiaris) [08:38:50] (03Merged) 10jenkins-bot: scaffold: Remove last appbase_url_port mention [deployment-charts] - 10https://gerrit.wikimedia.org/r/569757 (owner: 10Alexandros Kosiaris) [08:40:06] (03PS3) 10Alexandros Kosiaris: wikifeeds: Add tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/567100 [08:40:08] (03PS5) 10Alexandros Kosiaris: WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 [08:46:33] !log Deploy schema change on enwiki codfw - T243804 [08:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:36] T243804: Apply: Add primary key to pagetriage_page_tags - https://phabricator.wikimedia.org/T243804 [08:49:02] (03PS6) 10Alexandros Kosiaris: WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 [08:49:12] (03CR) 10jerkins-bot: [V: 04-1] WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 (owner: 10Alexandros Kosiaris) [08:49:41] (03PS6) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) [08:50:53] (03PS7) 10Alexandros Kosiaris: WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 [08:51:10] (03CR) 10jerkins-bot: [V: 04-1] WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 (owner: 10Alexandros Kosiaris) [08:53:31] 10Operations, 10Traffic: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) [08:56:45] !log Deploy schema change on enwiki eqiad host by host - T243804 [08:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:49] T243804: Apply: Add primary key to pagetriage_page_tags - https://phabricator.wikimedia.org/T243804 [09:01:34] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Port varnishlog consumers to log to syslog / logging infra - https://phabricator.wikimedia.org/T227108 (10fgiunchedi) >>! In T227108#5844341, @fgiunchedi wrote: > Had to revert in https://gerrit.wikimedia.org/r/c/operation... [09:04:21] (03PS2) 10Filippo Giunchedi: cassandra: use wmflib::secret for binary files [puppet] - 10https://gerrit.wikimedia.org/r/569570 (https://phabricator.wikimedia.org/T242585) [09:04:23] (03PS3) 10Filippo Giunchedi: wip: cassandra logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/569564 (https://phabricator.wikimedia.org/T242585) [09:07:45] !log Upgrade s3 codfw master db2105 - T239791 [09:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:49] T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [09:08:19] !log manually refreshing OCSP stapling response for non-canonical-redirects-3 - T243948 [09:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:22] T243948: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 [09:11:45] (03PS8) 10Alexandros Kosiaris: WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 [09:11:59] (03CR) 10jerkins-bot: [V: 04-1] WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 (owner: 10Alexandros Kosiaris) [09:23:41] 10Operations, 10observability: Upgrade Grafana to 6.4 - https://phabricator.wikimedia.org/T244208 (10fgiunchedi) [09:25:07] (03CR) 10Muehlenhoff: [C: 03+1] add IP addresses for new install servers on buster [dns] - 10https://gerrit.wikimedia.org/r/569679 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [09:29:39] (03CR) 10Muehlenhoff: [C: 04-2] switch apt.wikimedia.org from install1002 to install1003 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/569682 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [09:31:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifeeds: Add tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/567100 (owner: 10Alexandros Kosiaris) [09:31:54] (03Merged) 10jenkins-bot: wikifeeds: Add tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/567100 (owner: 10Alexandros Kosiaris) [09:33:48] 10Operations, 10Analytics: Host analytics1073 is DOWN - https://phabricator.wikimedia.org/T244064 (10elukey) 05Open→03Resolved a:03elukey The host has been stable since then, let's re-open if it re-happens. [09:35:18] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:36:59] 10Operations, 10Analytics: analytics1061 is down - https://phabricator.wikimedia.org/T244081 (10elukey) 05Open→03Resolved a:03elukey Didn't re-occur, will re-open if needed. [09:41:52] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade and restart m5 master (db1133) - https://phabricator.wikimedia.org/T244209 (10Marostegui) [09:42:06] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade and restart m5 master (db1133) - https://phabricator.wikimedia.org/T244209 (10Marostegui) p:05Triage→03Normal [09:46:26] (03CR) 10Ema: [C: 03+1] "One nit, +1 otherwise" (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/569754 (https://phabricator.wikimedia.org/T240614) (owner: 10Vgutierrez) [09:53:28] (03PS4) 10Ema: Reapply "ATS: unset Accept-Encoding" [puppet] - 10https://gerrit.wikimedia.org/r/564005 (https://phabricator.wikimedia.org/T242478) [09:57:18] (03PS1) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: puppetmaster: refactor hiera (for VM instances)" [puppet] - 10https://gerrit.wikimedia.org/r/569880 [09:58:39] (03CR) 10jerkins-bot: [V: 04-1] Revert "cloud: hiera: puppetmaster: refactor hiera (for VM instances)" [puppet] - 10https://gerrit.wikimedia.org/r/569880 (owner: 10Arturo Borrero Gonzalez) [09:59:37] (03PS2) 10Arturo Borrero Gonzalez: Revert "cloud: hiera: puppetmaster: refactor hiera (for VM instances)" [puppet] - 10https://gerrit.wikimedia.org/r/569880 [10:01:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cloud: hiera: puppetmaster: refactor hiera (for VM instances)" [puppet] - 10https://gerrit.wikimedia.org/r/569880 (owner: 10Arturo Borrero Gonzalez) [10:05:29] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) @Cmjohnson @Jclark-ctr let's sync about next steps whenever you have time! [10:13:44] (03CR) 10Jcrespo: [C: 03+1] "All clear to me now." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525147 (owner: 10Aaron Schulz) [10:16:55] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops: siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Schnark) I see a `"time": ""` in the output by `siteinfo`. I don't know if and how anyone uses that... [10:26:10] (03PS18) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [10:35:15] !log cp4032: varnish-frontend-restart T243634 [10:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:19] T243634: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 [10:44:11] jynus: got your email; it's just that I didn't update the whole doc and just part of it and I got confused 2 years later (heh) [10:46:07] PROBLEM - Host blog.wikimedia.org is DOWN: /bin/ping -n -U -w 15 -c 5 blog.wikimedia.org [10:46:26] (03PS4) 10Vgutierrez: dns-01: Support custom DNS server port [software/acme-chief] - 10https://gerrit.wikimedia.org/r/569754 (https://phabricator.wikimedia.org/T240614) [10:47:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "This change introduces the following additional checks:" [puppet] - 10https://gerrit.wikimedia.org/r/564690 (owner: 10Giuseppe Lavagetto) [10:47:29] RECOVERY - Host blog.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [10:48:45] (03CR) 10Vgutierrez: dns-01: Support custom DNS server port (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/569754 (https://phabricator.wikimedia.org/T240614) (owner: 10Vgutierrez) [10:56:29] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The python code LGTM, but this is an nrpe check." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [11:07:49] 10Operations, 10Analytics: Add metadata to puppet about kerberos accounts - https://phabricator.wikimedia.org/T235418 (10elukey) 05Open→03Resolved a:03elukey This has been done: users with existing kerberos principals have been backfilled, and now we have in place a procedure to add a flag to puppet each... [11:32:53] (03CR) 10Effie Mouzeli: [C: 03+1] "If we are absolutely sure this is not needed anywhere else, I am happy to merge it" [puppet] - 10https://gerrit.wikimedia.org/r/526255 (https://phabricator.wikimedia.org/T227734) (owner: 10Jforrester) [11:33:44] (03CR) 10Effie Mouzeli: [C: 03+1] "Is there anything else we need in order to merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/564175 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [11:41:55] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 82 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [11:42:10] checking [11:42:11] (03CR) 10Effie Mouzeli: "> I am ok with the change, even if the cluster name 'memcached_gutter'" [puppet] - 10https://gerrit.wikimedia.org/r/569579 (https://phabricator.wikimedia.org/T240684) (owner: 10Effie Mouzeli) [11:48:24] backups are a bit saturated right now due to ongoing full backups + crossdc transmission [11:49:02] will not ack the alert to monitor ongoing state, in case it gets worse, for now it should be ok to leave it as is [11:50:14] I get timeouts at nowiki [11:52:15] jeblad: could you give us more info, just visiting pages? [11:52:53] are you in Europe? [11:53:15] Don't you know where I am? =) [11:53:28] Norway [11:53:59] I am in the centrum of the universe, everything revolves around me! [11:54:07] I was just asking to research which dc I should look at [11:55:29] Should I provide a trace? [11:56:09] an mtr would be great [11:56:23] It seems like pages load, but diff hangs [11:56:23] you can reduct your IP address from it [11:56:44] jeblad: that is important, could you give us an url that doesn't work for you? [11:57:08] https://no.wikipedia.org/w/index.php?title=Narvik&diff=20162573&oldid=20162555&diffmode=source [11:57:24] It loads, but is really slov [11:58:02] jeblad: we will add it to a task, I think we have a similar issue [11:58:13] I can reproduce, so likely not a conectivity issue but edge or application [11:58:23] Ok, then I shall not nag anymore! =) [11:58:34] takk [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200204T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:37] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Page takes over 15s to load: https://en.wikipedia.org/w/index.php?title=European_Union&type=revision&diff=938561921&oldid=938557616 - https://phabricator.wikimedia.org/T244058 (10jijiki) [12:00:45] (03PS2) 10Mvolz: Remove config for xisbn [deployment-charts] - 10https://gerrit.wikimedia.org/r/567023 [12:00:55] jeblad: sorry for the inconveniences, I can see it is high priority ^ hopefully will be solved soon [12:01:54] I posted a note at our market place, so people know it is worked on. Thanks! [12:02:09] thanks to you for the report! [12:06:09] (03PS5) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) [12:06:46] (03PS3) 10Mvolz: Remove config for xisbn and update [deployment-charts] - 10https://gerrit.wikimedia.org/r/567023 [12:07:16] (03CR) 10Hnowlan: mediawiki: check mw versions match those on the deploy server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [12:08:32] (03CR) 10Urbanecm: [C: 04-1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567306 (https://phabricator.wikimedia.org/T243509) (owner: 10Ammarpad) [12:12:46] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10jcrespo) [12:26:17] (03CR) 10Vgutierrez: [C: 03+2] dns-01: Support custom DNS server port [software/acme-chief] - 10https://gerrit.wikimedia.org/r/569754 (https://phabricator.wikimedia.org/T240614) (owner: 10Vgutierrez) [12:30:24] (03PS1) 10Vgutierrez: Release 0.22 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570035 (https://phabricator.wikimedia.org/T240614) [12:35:09] (03CR) 10Vgutierrez: [C: 03+2] Release 0.22 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570035 (https://phabricator.wikimedia.org/T240614) (owner: 10Vgutierrez) [12:37:59] 10Operations, 10ops-codfw, 10netops: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10ayounsi) Yep, it's fine to delete it if there are no more member interfaces. [12:42:28] (03PS1) 10Ayounsi: Depool ulsfo for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/570036 (https://phabricator.wikimedia.org/T242947) [12:45:36] (03PS1) 10Vgutierrez: debian: Add release 0.22 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570037 (https://phabricator.wikimedia.org/T240614) [12:46:28] (03PS2) 10Vgutierrez: debian: Add release 0.22 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570037 (https://phabricator.wikimedia.org/T240614) [12:46:29] (03PS1) 10Vgutierrez: acme_chief: Fix newly reported pylint R1724 (no-else-continue) [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570038 [12:46:33] (03PS1) 10Vgutierrez: dns-01: Support custom DNS server port [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570039 (https://phabricator.wikimedia.org/T240614) [12:48:12] (03CR) 10Ayounsi: [C: 03+2] Depool ulsfo for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/570036 (https://phabricator.wikimedia.org/T242947) (owner: 10Ayounsi) [12:49:08] !log depool ulsfo for routers upgrade [12:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:00] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Fix newly reported pylint R1724 (no-else-continue) [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570038 (owner: 10Vgutierrez) [12:50:07] (03CR) 10Vgutierrez: [C: 03+2] dns-01: Support custom DNS server port [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570039 (https://phabricator.wikimedia.org/T240614) (owner: 10Vgutierrez) [12:51:17] (03PS3) 10Vgutierrez: debian: Add release 0.22 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570037 (https://phabricator.wikimedia.org/T240614) [12:51:19] (03PS1) 10Vgutierrez: Release 0.22 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570040 (https://phabricator.wikimedia.org/T240614) [12:56:06] (03CR) 10Vgutierrez: [C: 03+2] Release 0.22 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570040 (https://phabricator.wikimedia.org/T240614) (owner: 10Vgutierrez) [12:57:08] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.22 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570037 (https://phabricator.wikimedia.org/T240614) (owner: 10Vgutierrez) [13:01:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Bump the version string in Chart.yaml as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/567023 (owner: 10Mvolz) [13:01:21] (03PS1) 10Vgutierrez: acme_chief: Hit auth dns servers on port 5353 [puppet] - 10https://gerrit.wikimedia.org/r/570041 (https://phabricator.wikimedia.org/T240614) [13:01:47] (03CR) 10Jdlrobson: [C: 04-1] Enable lead paragraph in user namespace on nlwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [13:04:52] (03PS2) 10Vgutierrez: acme_chief: Hit auth dns servers on port 5353 [puppet] - 10https://gerrit.wikimedia.org/r/570041 (https://phabricator.wikimedia.org/T240614) [13:06:22] (03CR) 10Vgutierrez: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1003/20594/" [puppet] - 10https://gerrit.wikimedia.org/r/570041 (https://phabricator.wikimedia.org/T240614) (owner: 10Vgutierrez) [13:09:01] !log restart cr4-ulsfo for upgrade [13:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:53] Uptime: 188d18h27m53s not too bad [13:09:59] (03PS1) 10Alexandros Kosiaris: wikifeeds: Remove appbase_url_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/570043 [13:10:01] (03PS1) 10Alexandros Kosiaris: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/570044 [13:10:03] (03PS1) 10Alexandros Kosiaris: _scaffold: Fix YAML syntax error [deployment-charts] - 10https://gerrit.wikimedia.org/r/570045 [13:10:31] and I have it booting in the console, should be up in less than 20min [13:10:46] !log uploaded acme-chief 0.22 to apt.wm.o (buster) - T240614 [13:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:50] T240614: Fix acme-chief DNS validation correctly - https://phabricator.wikimedia.org/T240614 [13:12:44] (03Abandoned) 10Alexandros Kosiaris: WIP: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/567103 (owner: 10Alexandros Kosiaris) [13:12:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/570044 (owner: 10Alexandros Kosiaris) [13:12:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifeeds: Remove appbase_url_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/570043 (owner: 10Alexandros Kosiaris) [13:12:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] _scaffold: Fix YAML syntax error [deployment-charts] - 10https://gerrit.wikimedia.org/r/570045 (owner: 10Alexandros Kosiaris) [13:13:27] (03Merged) 10jenkins-bot: wikifeeds: Remove appbase_url_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/570043 (owner: 10Alexandros Kosiaris) [13:13:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:13:29] (03Merged) 10jenkins-bot: wikifeeds: Move to the debug functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/570044 (owner: 10Alexandros Kosiaris) [13:13:31] (03Merged) 10jenkins-bot: _scaffold: Fix YAML syntax error [deployment-charts] - 10https://gerrit.wikimedia.org/r/570045 (owner: 10Alexandros Kosiaris) [13:13:55] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:14:25] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:15:20] and it's back [13:16:15] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:16:17] routing stuff slowly comming up too [13:16:43] nice job Juniper, less than 10min [13:17:09] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:17:35] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:19:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/569579 (https://phabricator.wikimedia.org/T240684) (owner: 10Effie Mouzeli) [13:21:35] everything is up except Equinix BGP sessions [13:23:48] !log upgrading acme-chief to version 0.22 - T240614 [13:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:55] T240614: Fix acme-chief DNS validation correctly - https://phabricator.wikimedia.org/T240614 [13:30:40] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Hit auth dns servers on port 5353 [puppet] - 10https://gerrit.wikimedia.org/r/570041 (https://phabricator.wikimedia.org/T240614) (owner: 10Vgutierrez) [13:32:14] so I'm seeing all the broadcast traffic on that Equinix interface, but no unicast [13:33:36] ifPhysAddress: ec3873753cbc -> ec38737538c4 [13:33:57] upgrading the router software changes the mac address of the interface [13:33:58] great. [13:33:58] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1002/20592/" [puppet] - 10https://gerrit.wikimedia.org/r/569579 (https://phabricator.wikimedia.org/T240684) (owner: 10Effie Mouzeli) [13:34:04] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] hieradata: put memcached gutter hosts in cluster memcached_gutter [puppet] - 10https://gerrit.wikimedia.org/r/569579 (https://phabricator.wikimedia.org/T240684) (owner: 10Effie Mouzeli) [13:34:25] 10Operations, 10Traffic, 10Patch-For-Review: Fix acme-chief DNS validation correctly - https://phabricator.wikimedia.org/T240614 (10Vgutierrez) 05Open→03Resolved a:05BBlack→03Vgutierrez Solved in acme-chief 0.22, now we can set an arbitrary DNS port to validate the DNS-01 challenges on the acme-chief... [13:34:31] (03PS4) 10Mvolz: Remove config for xisbn and update [deployment-charts] - 10https://gerrit.wikimedia.org/r/567023 [13:34:35] (03PS5) 10Mvolz: Remove config for xisbn and update [deployment-charts] - 10https://gerrit.wikimedia.org/r/567023 [13:35:25] alright forced the old MAC [13:35:30] and now they're comming up [13:35:45] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:35:58] (03PS6) 10Mvolz: Remove config for xisbn and update [deployment-charts] - 10https://gerrit.wikimedia.org/r/567023 [13:36:30] (03PS7) 10Mvolz: Remove config for xisbn and update [deployment-charts] - 10https://gerrit.wikimedia.org/r/567023 [13:36:53] (03CR) 10Ema: [C: 03+1] acme_chief: Hit auth dns servers on port 5353 [puppet] - 10https://gerrit.wikimedia.org/r/570041 (https://phabricator.wikimedia.org/T240614) (owner: 10Vgutierrez) [13:36:56] !log restart cr3-ulsfo for software upgrade [13:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:32] (03CR) 10Mvolz: Remove config for xisbn and update (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/567023 (owner: 10Mvolz) [13:39:27] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:39:37] pwpw [13:39:40] arg [13:39:42] sigh [13:40:54] and it's up [13:41:38] 10Operations, 10Traffic: acme-chief should be able to refresh OCSP stapling response even if the renewal process fails - https://phabricator.wikimedia.org/T244232 (10Vgutierrez) [13:41:49] \o/ [13:41:55] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:42:05] 10Operations, 10Traffic: acme-chief should be able to refresh OCSP stapling response even if the renewal process fails - https://phabricator.wikimedia.org/T244232 (10Vgutierrez) p:05Triage→03Normal [13:42:35] (for the task just filed, not for the latency alert) [13:42:35] 10Operations, 10Traffic: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez After solving T240614, acme-chief has been able to renew non-canonical-redirect-3 so OCSP stapling refresh is fi... [13:42:35] routing still taking its time to fully come up [13:43:46] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:44:57] alright looks fully up [13:45:09] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 19, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:45:31] (03PS1) 10Ayounsi: Revert "Depool ulsfo for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/570049 [13:46:07] going to give it 20min or so before repool [13:47:29] (03PS1) 10Joal: Update AQS druid snapshot to 2020-01 [puppet] - 10https://gerrit.wikimedia.org/r/570050 [13:48:37] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [13:48:47] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [13:50:11] (03PS1) 10KartikMistry: Enable CX in te, kn, gu, mr and pawiki as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570051 (https://phabricator.wikimedia.org/T243271) [13:53:22] (03PS4) 10Ammarpad: Enable lead paragraph in user namespace on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) [13:54:13] (03CR) 10jerkins-bot: [V: 04-1] Enable lead paragraph in user namespace on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [13:54:33] 10Operations, 10Traffic, 10Wikimedia-Incident: cp3050 depooled due to explosion in CPU usage and inuse sockets - https://phabricator.wikimedia.org/T241001 (10ema) 05Open→03Resolved text@esams has had no similar issues since we disabled xdebug in December, closing. [13:54:49] (03CR) 10Jdlrobson: "feel free to deploy this change with the comma removed!" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [13:55:53] (03CR) 10Elukey: [C: 03+2] Update AQS druid snapshot to 2020-01 [puppet] - 10https://gerrit.wikimedia.org/r/570050 (owner: 10Joal) [13:56:07] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [13:57:49] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:58:59] (03PS5) 10Ammarpad: Enable lead paragraph in user namespace on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) [13:59:18] 10Operations, 10Traffic: Remove debug proxies once all Varnish backends are gone - https://phabricator.wikimedia.org/T237932 (10ema) [13:59:20] 10Operations, 10serviceops: decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10ema) [14:00:30] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [14:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:03:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [14:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:53] 10Operations, 10Acme-chief, 10Traffic: acme-chief is unable to renew certificates against LE staging environment - https://phabricator.wikimedia.org/T244236 (10Vgutierrez) [14:04:55] 10Operations, 10Acme-chief, 10Traffic: acme-chief is unable to renew certificates against LE staging environment - https://phabricator.wikimedia.org/T244236 (10Vgutierrez) p:05Triage→03High [14:06:21] (03PS1) 10Muehlenhoff: Switch to yaml.safe_load to loading update spec files [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/570054 [14:07:06] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool ulsfo for routers upgrade" [dns] - 10https://gerrit.wikimedia.org/r/570049 (owner: 10Ayounsi) [14:07:28] !log repool ulsfo [14:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:03] (03PS1) 10Alexandros Kosiaris: wikifeeds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/570055 [14:14:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifeeds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/570055 (owner: 10Alexandros Kosiaris) [14:14:35] (03Merged) 10jenkins-bot: wikifeeds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/570055 (owner: 10Alexandros Kosiaris) [14:15:57] (03CR) 10Ammarpad: "Template:Welcome (in English) redirects to same thing in Chinese and that's what the extension substitutes as the message. It needs not b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567306 (https://phabricator.wikimedia.org/T243509) (owner: 10Ammarpad) [14:16:15] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [14:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:41] (03CR) 10Ammarpad: "*The only thing, this parameter does, is redefining the users eligible to receive the message. Before, only locally created, but now inclu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567306 (https://phabricator.wikimedia.org/T243509) (owner: 10Ammarpad) [14:17:54] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [14:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:46] !log deploy new wikifeeds chart that is consistent with the current scaffolding approach. No code deploy though. [14:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:03] (03PS2) 10Ammarpad: Enable new user message for auto-created accounts on zh_classical wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567306 (https://phabricator.wikimedia.org/T243509) [14:19:33] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade and restart m5 master (db1133) - https://phabricator.wikimedia.org/T244209 (10Andrew) For a few seconds interruption I wouldn't expect this to be very disruptive. If you schedule it in my morning (e.g. 15:00 UTC) then I can send... [14:23:20] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [14:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:43] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10ayounsi) Anytime works for LibreNMS. [14:24:55] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade and restart m5 master (db1133) - https://phabricator.wikimedia.org/T244209 (10Marostegui) >>! In T244209#5848154, @Andrew wrote: > For a few seconds interruption I wouldn't expect this to be very disruptive. If you schedule it i... [14:25:14] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10Marostegui) >>! In T244238#5848193, @ayounsi wrote: > Anytime works for LibreNMS. Thank you! <3 [14:26:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove config for xisbn and update [deployment-charts] - 10https://gerrit.wikimedia.org/r/567023 (owner: 10Mvolz) [14:26:20] (03PS8) 10Alexandros Kosiaris: Remove config for xisbn and update [deployment-charts] - 10https://gerrit.wikimedia.org/r/567023 (owner: 10Mvolz) [14:27:18] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade and restart m5 master (db1133) - https://phabricator.wikimedia.org/T244209 (10Andrew) >>! In T244209#5848195, @Marostegui wrote: > > Thank you! > What about Monday 10th at 15:00 UTC? Works for me! I'll put it on the team calen... [14:27:33] (03CR) 10Elukey: Add profile::analytics::refinery::job::import_wikidata_entites_dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [14:27:48] (03PS1) 10Alexandros Kosiaris: citoid: build package, update repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/570058 [14:28:00] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade and restart m5 master (db1133) - https://phabricator.wikimedia.org/T244209 (10Marostegui) >>! In T244209#5848200, @Andrew wrote: >>>! In T244209#5848195, @Marostegui wrote: >> >> Thank you! >> What about Monday 10th at 15:00 UTC... [14:28:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] citoid: build package, update repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/570058 (owner: 10Alexandros Kosiaris) [14:28:51] (03Merged) 10jenkins-bot: citoid: build package, update repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/570058 (owner: 10Alexandros Kosiaris) [14:29:11] (03PS1) 10Effie Mouzeli: hieradata: put memcached gutter cluster in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/570059 (https://phabricator.wikimedia.org/T240684) [14:30:06] (03CR) 10jerkins-bot: [V: 04-1] hieradata: put memcached gutter cluster in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/570059 (https://phabricator.wikimedia.org/T240684) (owner: 10Effie Mouzeli) [14:32:34] (03PS2) 10Effie Mouzeli: hieradata: put memcached gutter cluster in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/570059 (https://phabricator.wikimedia.org/T240684) [14:33:43] (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: put memcached gutter cluster in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/570059 (https://phabricator.wikimedia.org/T240684) (owner: 10Effie Mouzeli) [14:33:52] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade and restart m5 master (db1133) - https://phabricator.wikimedia.org/T244209 (10Andrew) I'll do it now. [14:34:31] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:34:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] arclamp-log: abort if no message received in 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/568732 (https://phabricator.wikimedia.org/T215740) (owner: 10Ori.livneh) [14:35:23] (03CR) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [14:35:56] (03PS7) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) [14:38:05] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [14:39:41] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade and restart m5 master (db1133) - https://phabricator.wikimedia.org/T244209 (10Marostegui) Thank you! [14:40:13] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1003/20600/" [puppet] - 10https://gerrit.wikimedia.org/r/570059 (https://phabricator.wikimedia.org/T240684) (owner: 10Effie Mouzeli) [14:40:18] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] hieradata: put memcached gutter cluster in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/570059 (https://phabricator.wikimedia.org/T240684) (owner: 10Effie Mouzeli) [14:40:41] (03PS8) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) [14:40:51] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade and restart m5 master (db1133) - https://phabricator.wikimedia.org/T244209 (10Andrew) [x] email sent to wikitech-l and cloud-announce [14:42:31] (03PS1) 10Effie Mouzeli: hieradata: fix new lines in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/570062 [14:43:24] (03PS1) 10Alexandros Kosiaris: helpers: Move most charts to common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/570064 [14:43:43] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:44:21] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 83 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [14:45:56] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: Allow applying spam filter fules before sender rules in Mailman filtering - https://phabricator.wikimedia.org/T170443 (10Aklapper) p:05Triage→03Lowest [14:49:29] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [14:55:20] 10Operations, 10Wikimedia-Mailing-lists: Allow list admins to train spam filters - https://phabricator.wikimedia.org/T244241 (10Aklapper) [14:55:42] (03PS6) 10Giuseppe Lavagetto: lvs::monitor: use unique identifiers for services [puppet] - 10https://gerrit.wikimedia.org/r/565290 [14:58:18] 10Operations, 10Traffic: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10ema) We now know that this is 100% traffic induced, and the culprit seems to be FortiGate. Compare the last 24 hours of FD growth: {F31547496} And ulsfo requests with UA: FortiGate durin... [15:01:03] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [15:01:21] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [15:01:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Only in prod:" [puppet] - 10https://gerrit.wikimedia.org/r/565290 (owner: 10Giuseppe Lavagetto) [15:04:08] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10Trizek-WMF) > @Trizek-WMF I have CC'ed you here as last time you posted a message on technews for etherpad (T231403#5464235) - let's wait for a concrete day/time. You just need to... [15:08:13] (03PS1) 10Andrew Bogott: Buster/Mitaka vms: include python3 versions of openstack clients [puppet] - 10https://gerrit.wikimedia.org/r/570066 [15:08:36] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10thcipriani) > Both are running 1.35.0-wmf.16 at the moment. When this task was filled on 2020-02-01 the wikipedia wikis were on 1.35.0-wmf.15 due to... [15:19:38] ls [15:19:41] err [15:20:12] ls: cannot access 'err': no such file or directory [15:20:16] (03CR) 10Andrew Bogott: [C: 03+2] Buster/Mitaka vms: include python3 versions of openstack clients [puppet] - 10https://gerrit.wikimedia.org/r/570066 (owner: 10Andrew Bogott) [15:25:01] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:30:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:31:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:32:40] (03PS1) 10Giuseppe Lavagetto: profile::configmaster: use wmflib::service functions [puppet] - 10https://gerrit.wikimedia.org/r/570070 [15:32:42] (03PS1) 10Giuseppe Lavagetto: profile::lvs::realserver: use wmflib::service::fetch [puppet] - 10https://gerrit.wikimedia.org/r/570071 [15:32:44] (03PS1) 10Giuseppe Lavagetto: profile::cache::base: remove the useless inclusion of lvs::configuration [puppet] - 10https://gerrit.wikimedia.org/r/570072 [15:39:26] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Cmjohnson) [15:40:29] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:43:35] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:43:50] 10Operations, 10serviceops, 10Performance-Team (Radar): decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10Krinkle) [15:43:54] (03CR) 10Ppchelko: [C: 03+1] Configure group0 for kask-transition (multi-write kask/redis) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569678 (https://phabricator.wikimedia.org/T243106) (owner: 10Eevans) [15:44:13] 10Operations, 10serviceops, 10Performance-Team (Radar): decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10Krinkle) [15:44:29] 10Operations, 10serviceops, 10Performance-Team (Radar): decom debug proxies (was: Migrate debug proxies to Stretch/Buster) - https://phabricator.wikimedia.org/T224567 (10Krinkle) Thanks! Less is more :) [15:48:14] (03PS1) 10Jbond: wmflablibs: add new facts for wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/570079 (https://phabricator.wikimedia.org/T244222) [15:49:20] (03CR) 10jerkins-bot: [V: 04-1] wmflablibs: add new facts for wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/570079 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [15:50:27] (03CR) 10Ema: [C: 03+2] Reapply "ATS: unset Accept-Encoding" [puppet] - 10https://gerrit.wikimedia.org/r/564005 (https://phabricator.wikimedia.org/T242478) (owner: 10Ema) [15:50:27] (03PS2) 10Jbond: wmflablibs: add new facts for wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/570079 (https://phabricator.wikimedia.org/T244222) [15:52:42] 10Operations, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Cmjohnson) a:05Cmjohnson→03Dzahn @Dzahn ganeti1017 is now ready. I am assigning this to you and removing ops-eqiad tag. [15:57:51] (03PS1) 10Cmjohnson: Adding netboot and dhcp file for cescout1001 [puppet] - 10https://gerrit.wikimedia.org/r/570081 (https://phabricator.wikimedia.org/T239250) [15:59:22] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:00:00] (03CR) 10Andrew Bogott: [C: 03+1] "Looks ok to me. I don't love the eqiad=>eqiad1 and codfw=>codfw1dev mapping under .wmflabs but with luck we'll have wmflabs out of the wa" [puppet] - 10https://gerrit.wikimedia.org/r/570079 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [16:01:17] (03CR) 10Jbond: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/570079 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [16:01:58] PROBLEM - traffic_server backend process restarted on cp2001 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2001&var-layer=backend [16:02:04] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:02:13] !log cp: rolling ats-backend-restart to unset Accept-Encoding before sending origin server requests T242478 [16:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:17] T242478: Production load.php spends ~ 10% time doing output compression within PHP - https://phabricator.wikimedia.org/T242478 [16:03:22] RECOVERY - traffic_server backend process restarted on cp2001 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2001&var-layer=backend [16:03:30] PROBLEM - phpfpm_up reduced availability on icinga1001 is CRITICAL: 0.7849 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:03:44] PROBLEM - Nginx local proxy to apache on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:03:46] PROBLEM - PHP7 rendering on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:03:46] PROBLEM - Nginx local proxy to apache on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:03:48] PROBLEM - Apache HTTP on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:03:54] PROBLEM - PHP7 rendering on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:03:56] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [16:04:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:04:20] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1344.eqiad.wmnet, mw1272.eqiad.wmnet, mw1320.eqiad.wmnet, mw1250.eqiad.wmnet, mw1266.eqiad.wmnet, mw1223.eqiad.wmnet, mw1282.eqiad.wmnet, mw1333.eqiad.wmnet, mw1241.eqiad.wmnet, mw1221.eqiad.wmnet, mw1317.eqiad.wmnet, mw1316.eqiad.wmnet, mw1325.eqiad.wmnet, mw1312.eqiad.wmnet, mw1347.eqiad.wmnet, mw1342.eqiad.wmnet, mw1341 [16:04:20] 313.eqiad.wmnet, mw1346.eqiad.wmnet, mw1246.eqiad.wmnet, mw1322.eqiad.wmnet, mw1288.eqiad.wmnet, mw1281.eqiad.wmnet, mw1314.eqiad.wmnet, mw1323.eqiad.wmnet, mw1348.eqiad.wmnet, mw1233.eqiad.wmnet, mw1327.eqiad.wmnet, mw1245.eqiad.wmnet, mw1225.eqiad.wmnet, mw1255.eqiad.wmnet, mw1257.eqiad.wmnet, mw1244.eqiad.wmnet, mw1238.eqiad.wmnet, mw1234.eqiad.wmnet, mw1235.eqiad.wmnet, mw1231.eqiad.wmnet, mw1285.eqiad.wmnet, mw1230.eqiad.wmn [16:04:20] wmnet, mw1232.eqiad.wmnet, mw1242.eqiad.wmnet, mw1240.eqiad.wmnet, mw1227.eqiad.wmnet, mw1 https://wikitech.wikimedia.org/wiki/PyBal [16:04:22] PROBLEM - Apache HTTP on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:26] PROBLEM - PHP7 rendering on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:04:28] PROBLEM - Apache HTTP on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:34] PROBLEM - Apache HTTP on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:34] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:34] PROBLEM - PHP7 rendering on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:04:38] PROBLEM - PHP7 rendering on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:04:40] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{ [16:04:40] obile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [16:04:40] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:04:40] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not [16:04:40] xistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:04:46] PROBLEM - Nginx local proxy to apache on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:48] PROBLEM - Nginx local proxy to apache on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:48] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:52] PROBLEM - Apache HTTP on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:52] PROBLEM - Apache HTTP on mw1252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:52] PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:54] PROBLEM - Nginx local proxy to apache on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:56] PROBLEM - Apache HTTP on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:56] PROBLEM - Apache HTTP on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:56] PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:56] PROBLEM - Apache HTTP on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:56] PROBLEM - Apache HTTP on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:57] PROBLEM - PHP7 rendering on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:04:57] PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:04:58] PROBLEM - Nginx local proxy to apache on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:00] PROBLEM - Apache HTTP on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:00] PROBLEM - Nginx local proxy to apache on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:00] PROBLEM - PHP7 rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:00] PROBLEM - Nginx local proxy to apache on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:04] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not [16:05:04] xistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:05:12] PROBLEM - LVS HTTPS IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:05:14] PROBLEM - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:05:14] PROBLEM - Apache HTTP on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:15] PROBLEM - Apache HTTP on mw1266 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:15] PROBLEM - PHP7 rendering on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:15] PROBLEM - Nginx local proxy to apache on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:15] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:05:15] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:05:16] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:05:16] PROBLEM - Nginx local proxy to apache on mw1264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:17] PROBLEM - Apache HTTP on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:17] PROBLEM - PHP7 rendering on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:18] PROBLEM - Nginx local proxy to apache on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:18] PROBLEM - Apache HTTP on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:19] PROBLEM - Apache HTTP on mw1265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:19] PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:20] PROBLEM - Nginx local proxy to apache on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:20] RECOVERY - PHP7 rendering on mw1275 is OK: HTTP OK: HTTP/1.1 200 OK - 76257 bytes in 8.763 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:21] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [16:05:21] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:05:22] PROBLEM - Nginx local proxy to apache on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:22] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [16:05:26] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:05:26] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:05:26] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:05:26] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [16:05:26] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:05:27] PROBLEM - PHP7 rendering on mw1244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:27] PROBLEM - Nginx local proxy to apache on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:28] PROBLEM - Nginx local proxy to apache on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:28] PROBLEM - PHP7 rendering on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:29] PROBLEM - Apache HTTP on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:29] PROBLEM - Nginx local proxy to apache on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:30] PROBLEM - Apache HTTP on mw1253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:30] PROBLEM - Nginx local proxy to apache on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:31] PROBLEM - Apache HTTP on mw1264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:31] Uh-oh [16:05:32] PROBLEM - Nginx local proxy to apache on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:32] PROBLEM - PHP7 rendering on mw1263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:32] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:05:33] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:05:33] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:05:34] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [16:05:34] PROBLEM - Apache HTTP on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:35] PROBLEM - PHP7 rendering on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:35] PROBLEM - Nginx local proxy to apache on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:36] PROBLEM - PHP7 rendering on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:36] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1265.eqiad.wmnet, mw1256.eqiad.wmnet, mw1242.eqiad.wmnet, mw1240.eqiad.wmnet, mw1246.eqiad.wmnet, mw1253.eqiad.wmnet, mw1322.eqiad.wmnet, mw1331.eqiad.wmnet, mw1333.eqiad.wmnet, mw1323.eqiad.wmnet, mw1249.eqiad.wmnet, mw1328.eqiad.wmnet, mw1243.eqiad.wmnet, mw1245.eqiad.wmnet, mw1272.eqiad.wmnet, mw1263.eqiad. [16:05:37] ad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1250.eqiad.wmnet, mw1266.eqiad.wmnet, mw1326.eqiad.wmnet, mw1268.eqiad.wmnet, mw1319.eqiad.wmnet, mw1241.eqiad.wmnet, mw1324.eqiad.wmnet, mw1255.eqiad.wmnet, mw1257.eqiad.wmnet, mw1251.eqiad.wmnet, mw1244.eqiad.wmnet, mw1238.eqiad.wmnet, mw1321.eqiad.wmnet, mw1269.eqiad.wmnet, mw1325.eqiad.wmnet, mw1274.eqiad.wmnet, mw1254.eqiad.wmnet, mw1248.eqiad.wmnet, mw1252.eqiad.wmnet, mw1 [16:05:37] mw1330.eqiad.wmnet, mw1247.eqiad.wmnet, mw1239.eqiad.wmnet are marked down but pooled: api_80: Servers mw1232.eqiad.wmnet, mw1344.eqiad.wmnet, mw1227.eqiad.wmnet, mw1229.eqiad.wmnet, mw https://wikitech.wikimedia.org/wiki/PyBal [16:05:38] PROBLEM - Apache HTTP on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:38] PROBLEM - PHP7 rendering on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:39] PROBLEM - PHP7 rendering on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:39] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1265.eqiad.wmnet, mw1256.eqiad.wmnet, mw1242.eqiad.wmnet, mw1240.eqiad.wmnet, mw1246.eqiad.wmnet, mw1253.eqiad.wmnet, mw1267.eqiad.wmnet, mw1322.eqiad.wmnet, mw1331.eqiad.wmnet, mw1319.eqiad.wmnet, mw1323.eqiad.wmnet, mw1249.eqiad.wmnet, mw1327.eqiad.wmnet, mw1328.eqiad.wmnet, mw1243.eqiad.wmnet, mw1245.eqiad. [16:05:40] ad.wmnet, mw1263.eqiad.wmnet, mw1258.eqiad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1271.eqiad.wmnet, mw1264.eqiad.wmnet, mw1250.eqiad.wmnet, mw1266.eqiad.wmnet, mw1326.eqiad.wmnet, mw1321.eqiad.wmnet, mw1333.eqiad.wmnet, mw1241.eqiad.wmnet, mw1324.eqiad.wmnet, mw1255.eqiad.wmnet, mw1330.eqiad.wmnet, mw1257.eqiad.wmnet, mw1251.eqiad.wmnet, mw1244.eqiad.wmnet, mw1274.eqiad.wmnet, mw1268.eqiad.wmnet, mw1269.eqiad.wmnet, mw1 [16:05:40] mw1254.eqiad.wmnet, mw1248.eqiad.wmnet, mw1238.eqiad.wmnet, mw1252.eqiad.wmnet, mw1261.eqiad.wmnet, mw1270.eqiad.wmnet, mw1275.eqiad.wmnet, mw1262.eqiad.wmnet, mw1332.eqiad.wmnet, mw124 https://wikitech.wikimedia.org/wiki/PyBal [16:05:41] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [16:05:41] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [16:05:42] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:05:42] PROBLEM - Nginx local proxy to apache on mw1245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:44] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:05:44] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:05:44] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:05:48] PROBLEM - Nginx local proxy to apache on mw1340 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:53] PROBLEM - Apache HTTP on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:53] PROBLEM - Nginx local proxy to apache on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:53] PROBLEM - PHP7 rendering on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:53] PROBLEM - PHP7 rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:05:53] PROBLEM - Apache HTTP on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:53] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:05:53] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:05:54] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:05:54] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:05:55] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:05:55] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:05:56] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month [16:05:56] featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [16:05:57] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed ou [16:05:57] se was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:05:58] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [16:05:58] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:05:59] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [16:05:59] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:00] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [16:06:00] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:01] PROBLEM - PHP7 rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:06:01] PROBLEM - PHP7 rendering on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:06:02] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:02] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:03] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:03] PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:04] PROBLEM - Nginx local proxy to apache on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:04] PROBLEM - PHP7 rendering on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:06:05] PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:05] PROBLEM - PHP7 rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:06:06] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:06:06] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:06:07] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:06:07] PROBLEM - PHP7 rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:06:08] PROBLEM - LVS HTTPS IPv4 #page on text-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:06:08] PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:09] PROBLEM - Apache HTTP on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:09] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1344.eqiad.wmnet, mw1272.eqiad.wmnet, mw1320.eqiad.wmnet, mw1250.eqiad.wmnet, mw1266.eqiad.wmnet, mw1223.eqiad.wmnet, mw1282.eqiad.wmnet, mw1333.eqiad.wmnet, mw1241.eqiad.wmnet, mw1221.eqiad.wmnet, mw1317.eqiad.wmnet, mw1224.eqiad.wmnet, mw1316.eqiad.wmnet, mw1325.eqiad.wmnet, mw1312.eqiad.wmnet, mw1347.eqiad.wmnet, mw1342 [16:06:10] 270.eqiad.wmnet, mw1341.eqiad.wmnet, mw1332.eqiad.wmnet, mw1313.eqiad.wmnet, mw1346.eqiad.wmnet, mw1246.eqiad.wmnet, mw1322.eqiad.wmnet, mw1288.eqiad.wmnet, mw1281.eqiad.wmnet, mw1314.eqiad.wmnet, mw1323.eqiad.wmnet, mw1227.eqiad.wmnet, mw1233.eqiad.wmnet, mw1327.eqiad.wmnet, mw1340.eqiad.wmnet, mw1258.eqiad.wmnet, mw1225.eqiad.wmnet, mw1264.eqiad.wmnet, mw1255.eqiad.wmnet, mw1244.eqiad.wmnet, mw1331.eqiad.wmnet, mw1234.eqiad.wmn [16:06:10] wmnet, mw1231.eqiad.wmnet, mw1315.eqiad.wmnet, mw1285.eqiad.wmnet, mw1275.eqiad.wmnet, mw1 https://wikitech.wikimedia.org/wiki/PyBal [16:06:11] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:06:11] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{ [16:06:11] <_joe_> uhhh [16:06:12] obile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [16:06:12] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:06:13] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed ou [16:06:13] se was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:14] PROBLEM - LVS HTTPS IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:06:14] RECOVERY - PHP7 rendering on mw1246 is OK: HTTP OK: HTTP/1.1 200 OK - 76257 bytes in 3.389 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:06:15] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [16:06:15] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:16] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [16:06:16] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:17] good grief [16:06:17] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [16:06:17] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:18] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [16:06:18] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:19] PROBLEM - PHP7 rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:06:19] PROBLEM - Apache HTTP on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:20] PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:06:20] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [16:06:21] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:21] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [16:06:22] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:22] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [16:06:23] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [16:06:23] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:23] (03CR) 10Andrew Bogott: [C: 03+1] "> if there is a more authoritative way of getting that info" [puppet] - 10https://gerrit.wikimedia.org/r/570079 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [16:06:24] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{ [16:06:24] obile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [16:06:25] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:06:25] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{ [16:06:26] obile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [16:06:26] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:06:27] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [16:06:27] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:28] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [16:06:28] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:29] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:06:29] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:06:30] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:06:30] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [16:06:31] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:31] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [16:06:32] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:32] PROBLEM - Apache HTTP on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:33] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:33] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:34] PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [16:06:34] PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:35] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/m [16:06:35] } (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: [16:06:36] sform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:06:36] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [16:06:37] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/cre [16:06:37] eed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:06:38] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [16:06:38] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:39] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/pag [16:06:39] Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [16:06:40] PROBLEM - PHP7 rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:06:40] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:41] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:06:41] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [16:06:42] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:42] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [16:06:43] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:43] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [16:06:44] PROBLEM - Nginx local proxy to apache on mw1225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:44] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:45] RECOVERY - Apache HTTP on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.772 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:06:45] PROBLEM - Nginx local proxy to apache on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:46] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [16:06:46] PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:47] PROBLEM - PHP7 rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:06:47] PROBLEM - Nginx local proxy to apache on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:48] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{ [16:06:48] obile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [16:06:49] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:06:49] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:50] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITI [16:06:50] e the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title [16:06:51] article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [16:06:51] PROBLEM - PHP7 rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:06:52] PROBLEM - Nginx local proxy to apache on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:52] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:06:53] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed ou [16:06:53] se was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:54] PROBLEM - Nginx local proxy to apache on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:54] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:55] PROBLEM - PHP7 rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:06:55] PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:56] PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:56] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{ [16:06:57] obile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [16:06:57] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:06:58] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:06:58] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [16:06:59] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:06:59] PROBLEM - Nginx local proxy to apache on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:00] PROBLEM - Nginx local proxy to apache on mw1342 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:00] PROBLEM - Apache HTTP on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:01] RECOVERY - Apache HTTP on mw1249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.078 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:07:01] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out be [16:07:02] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:07:02] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:03] PROBLEM - Apache HTTP on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:03] PROBLEM - PHP7 rendering on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:07:04] PROBLEM - PHP7 rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:07:04] PROBLEM - PHP7 rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:07:05] PROBLEM - PHP7 rendering on mw1339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:07:05] PROBLEM - PHP7 rendering on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:07:06] PROBLEM - Nginx local proxy to apache on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:06] PROBLEM - PHP7 rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:07:07] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/pag [16:07:07] Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [16:07:08] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{ [16:07:08] obile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was [16:07:09] n}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:07:09] PROBLEM - Apache HTTP on mw1346 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:10] RECOVERY - Apache HTTP on mw1253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.474 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:07:16] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.846 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:07:16] well then [16:07:20] PROBLEM - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/71/ [16:07:24] PROBLEM - Apache HTTP on mw1263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:25] PROBLEM - Apache HTTP on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:25] PROBLEM - Nginx local proxy to apache on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:25] PROBLEM - Apache HTTP on mw1250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:27] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [16:07:28] RECOVERY - Nginx local proxy to apache on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.354 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:07:30] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:36] RECOVERY - PHP7 rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 8.929 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:07:36] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:07:42] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [16:07:42] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:07:42] PROBLEM - Nginx local proxy to apache on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:44] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.615 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:07:44] PROBLEM - Nginx local proxy to apache on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:44] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:07:48] RECOVERY - LVS HTTPS IPv4 #page on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15056 bytes in 8.601 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:07:48] PROBLEM - PHP7 rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:07:48] PROBLEM - Nginx local proxy to apache on mw1266 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:48] PROBLEM - Nginx local proxy to apache on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:48] PROBLEM - PHP7 rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:07:49] PROBLEM - Nginx local proxy to apache on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:49] PROBLEM - PHP7 rendering on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:07:50] PROBLEM - Apache HTTP on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:50] PROBLEM - Nginx local proxy to apache on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:51] PROBLEM - Nginx local proxy to apache on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:51] PROBLEM - Nginx local proxy to apache on mw1341 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:07:52] RECOVERY - PHP7 rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 9.419 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:07:52] RECOVERY - Apache HTTP on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.263 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:07:53] PROBLEM - Apache HTTP on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:08:06] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:08:06] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:08] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.851 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:08] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.506 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:10] RECOVERY - Apache HTTP on mw1252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.845 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:14] RECOVERY - Nginx local proxy to apache on mw1225 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.151 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:16] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.423 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:18] RECOVERY - Apache HTTP on mw1324 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.342 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:18] RECOVERY - Nginx local proxy to apache on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.696 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:20] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.011 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:22] RECOVERY - Nginx local proxy to apache on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.720 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:26] PROBLEM - PHP7 rendering on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:08:28] PROBLEM - LVS HTTPS IPv4 #page on text-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:08:34] RECOVERY - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15069 bytes in 8.914 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:08:34] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.749 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:34] PROBLEM - Nginx local proxy to apache on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:08:34] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.907 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:40] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out be [16:08:40] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:08:42] RECOVERY - Nginx local proxy to apache on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.827 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:44] RECOVERY - PHP7 rendering on mw1339 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 9.220 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:08:44] RECOVERY - PHP7 rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 9.405 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:08:44] RECOVERY - PHP7 rendering on mw1313 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 9.234 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:08:44] RECOVERY - PHP7 rendering on mw1244 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 8.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:08:48] RECOVERY - Apache HTTP on mw1272 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.810 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:54] RECOVERY - Apache HTTP on mw1325 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.839 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:08:58] RECOVERY - Nginx local proxy to apache on mw1245 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.834 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:09:02] RECOVERY - Apache HTTP on mw1250 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.138 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:09:06] RECOVERY - Nginx local proxy to apache on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.681 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:09:10] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.685 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:09:12] RECOVERY - Nginx local proxy to apache on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.744 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:09:12] RECOVERY - PHP7 rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 6.866 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:09:12] RECOVERY - PHP7 rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 7.232 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:09:20] RECOVERY - PHP7 rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 8.608 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:09:22] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.533 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:09:22] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:09:22] RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.716 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:09:24] RECOVERY - Nginx local proxy to apache on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.784 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:09:24] RECOVERY - PHP7 rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 7.537 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:09:28] RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.760 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:09:28] RECOVERY - PHP7 rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 8.506 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:09:32] RECOVERY - LVS HTTPS IPv6 #page on text-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15069 bytes in 8.057 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:09:32] RECOVERY - Apache HTTP on mw1248 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.404 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:09:44] PROBLEM - PHP7 rendering on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:09:50] PROBLEM - Nginx local proxy to apache on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:09:56] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.241 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:10:04] RECOVERY - Apache HTTP on mw1332 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.968 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:10:04] PROBLEM - PHP7 rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:10:04] PROBLEM - PHP7 rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:10:04] PROBLEM - Nginx local proxy to apache on mw1265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:10:04] PROBLEM - PHP7 rendering on mw1256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:10:06] RECOVERY - PHP7 rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 7.912 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:10:06] PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:10:06] PROBLEM - Nginx local proxy to apache on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:10:06] PROBLEM - PHP7 rendering on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:10:08] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.721 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:10:10] RECOVERY - LVS HTTPS IPv4 #page on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15056 bytes in 7.216 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:10:11] RECOVERY - PHP7 rendering on mw1317 is OK: HTTP OK: HTTP/1.1 200 OK - 76385 bytes in 9.072 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:10:11] RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.594 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:10:14] RECOVERY - Nginx local proxy to apache on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:10:16] RECOVERY - Nginx local proxy to apache on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.595 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:10:16] RECOVERY - Apache HTTP on mw1341 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.687 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:10:18] RECOVERY - Nginx local proxy to apache on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.663 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:10:20] PROBLEM - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:10:24] PROBLEM - PHP7 rendering on mw1247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:10:30] RECOVERY - PHP7 rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 8.724 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:10:34] RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.315 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:10:38] RECOVERY - Nginx local proxy to apache on mw1272 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.503 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:10:50] PROBLEM - Nginx local proxy to apache on mw1256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:10:50] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.456 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:10:58] PROBLEM - PHP7 rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:10:58] PROBLEM - Nginx local proxy to apache on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:11:00] RECOVERY - Apache HTTP on mw1319 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.819 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:04] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:11:04] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [16:11:04] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:11:04] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikiped [16:11:04] atured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:11:05] PROBLEM - Apache HTTP on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:11:06] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.014 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:10] RECOVERY - Nginx local proxy to apache on mw1324 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:12] RECOVERY - Nginx local proxy to apache on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.565 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:12] RECOVERY - PHP7 rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 6.291 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:13] 10Operations, 10ops-codfw, 10netops: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10Papaul) okay thanks will do that [16:11:14] RECOVERY - Apache HTTP on mw1333 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.426 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:14] RECOVERY - Apache HTTP on mw1323 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.664 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:16] RECOVERY - Nginx local proxy to apache on mw1341 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.144 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:16] RECOVERY - Nginx local proxy to apache on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.354 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:16] RECOVERY - Nginx local proxy to apache on mw1333 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.689 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:16] RECOVERY - Nginx local proxy to apache on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.983 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:20] PROBLEM - Apache HTTP on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:11:22] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.788 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:30] RECOVERY - PHP7 rendering on mw1275 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 9.296 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:30] RECOVERY - Nginx local proxy to apache on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 5.336 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:34] RECOVERY - Apache HTTP on mw1329 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.581 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:34] RECOVERY - PHP7 rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 2.392 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:38] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15069 bytes in 9.737 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:11:38] RECOVERY - Nginx local proxy to apache on mw1327 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.938 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:38] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.227 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:38] RECOVERY - Nginx local proxy to apache on mw1328 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.339 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:38] RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [16:11:38] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:11:39] * addshore reads up [16:11:39] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:11:42] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:11:42] RECOVERY - Apache HTTP on mw1331 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.689 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:44] RECOVERY - Nginx local proxy to apache on mw1267 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.985 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:44] RECOVERY - PHP7 rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 76385 bytes in 0.502 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:46] RECOVERY - PHP7 rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 1.887 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:46] RECOVERY - PHP7 rendering on mw1330 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 4.888 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:48] RECOVERY - PHP7 rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 1.272 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:48] RECOVERY - Apache HTTP on mw1271 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:48] RECOVERY - Nginx local proxy to apache on mw1271 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.077 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:48] RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.376 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:50] RECOVERY - Nginx local proxy to apache on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.464 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:50] RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 7.916 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:50] RECOVERY - PHP7 rendering on mw1321 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 7.935 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:50] RECOVERY - Nginx local proxy to apache on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:50] RECOVERY - PHP7 rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 6.879 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:52] RECOVERY - PHP7 rendering on mw1328 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 5.482 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:52] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:11:52] RECOVERY - Nginx local proxy to apache on mw1268 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.484 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:52] RECOVERY - Nginx local proxy to apache on mw1329 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.534 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:53] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:54] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [16:11:54] RECOVERY - PHP7 rendering on mw1329 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 6.951 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:54] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:11:55] RECOVERY - PHP7 rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 7.709 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:55] RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.747 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:56] RECOVERY - Nginx local proxy to apache on mw1316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.823 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:56] RECOVERY - PHP7 rendering on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 1.872 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:11:57] RECOVERY - Nginx local proxy to apache on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 8.468 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:58] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:11:58] RECOVERY - Nginx local proxy to apache on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.484 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:00] RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:12:00] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:12:00] RECOVERY - PHP7 rendering on mw1272 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 2.357 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:04] RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15056 bytes in 3.909 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:12:06] RECOVERY - LVS HTTPS IPv6 #page on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15069 bytes in 4.386 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:12:06] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.499 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:06] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:12:06] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:12:06] RECOVERY - PHP7 rendering on mw1268 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 5.373 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:07] RECOVERY - Nginx local proxy to apache on mw1321 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 4.501 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:07] RECOVERY - Apache HTTP on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 5.571 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:08] RECOVERY - Apache HTTP on mw1327 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:08] RECOVERY - Nginx local proxy to apache on mw1326 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.687 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:09] RECOVERY - Nginx local proxy to apache on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:09] RECOVERY - Nginx local proxy to apache on mw1330 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 2.423 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:10] RECOVERY - Nginx local proxy to apache on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 6.953 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:10] RECOVERY - PHP7 rendering on mw1247 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 3.248 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:12] RECOVERY - Nginx local proxy to apache on mw1331 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 2.330 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:12] RECOVERY - Nginx local proxy to apache on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:12] RECOVERY - PHP7 rendering on mw1346 is OK: HTTP OK: HTTP/1.1 200 OK - 76385 bytes in 0.664 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:12] RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.747 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:14] RECOVERY - Nginx local proxy to apache on mw1275 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:14] RECOVERY - PHP7 rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 76385 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:16] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:12:16] RECOVERY - PHP7 rendering on mw1326 is OK: HTTP OK: HTTP/1.1 200 OK - 76385 bytes in 0.522 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:16] RECOVERY - Apache HTTP on mw1346 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.641 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:16] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:12:18] RECOVERY - Nginx local proxy to apache on mw1332 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.606 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:18] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:12:18] RECOVERY - Nginx local proxy to apache on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 2.682 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:18] RECOVERY - Nginx local proxy to apache on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 2.979 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:18] RECOVERY - PHP7 rendering on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:20] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:12:20] RECOVERY - PHP7 rendering on mw1271 is OK: HTTP OK: HTTP/1.1 200 OK - 76385 bytes in 0.685 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:22] RECOVERY - PHP7 rendering on mw1327 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 1.938 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:22] RECOVERY - Nginx local proxy to apache on mw1322 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:24] RECOVERY - PHP7 rendering on mw1331 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 1.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:24] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:12:26] RECOVERY - PHP7 rendering on mw1333 is OK: HTTP OK: HTTP/1.1 200 OK - 76385 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:26] RECOVERY - PHP7 rendering on mw1322 is OK: HTTP OK: HTTP/1.1 200 OK - 76385 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:26] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:12:28] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:12:30] RECOVERY - Nginx local proxy to apache on mw1256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.305 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:32] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.469 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:36] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:12:38] RECOVERY - PHP7 rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 76385 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:38] RECOVERY - Nginx local proxy to apache on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.146 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:42] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:44] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [16:12:44] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:12:46] RECOVERY - Apache HTTP on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.486 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:46] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:12:46] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:12:46] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:12:50] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:12:50] RECOVERY - Apache HTTP on mw1267 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.539 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:54] RECOVERY - PHP7 rendering on mw1267 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 1.531 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:56] RECOVERY - PHP7 rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 2.170 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:56] RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:58] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:12:58] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:12:58] RECOVERY - PHP7 rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 1.266 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:12:58] RECOVERY - Nginx local proxy to apache on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.520 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:58] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:12:59] RECOVERY - Nginx local proxy to apache on mw1325 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.767 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:12:59] RECOVERY - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1906 bytes in 0.093 second response time https://phabricator.wikimedia.org/project/view/71/ [16:13:00] RECOVERY - PHP7 rendering on mw1320 is OK: HTTP OK: HTTP/1.1 200 OK - 76386 bytes in 1.844 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:13:01] RECOVERY - Apache HTTP on mw1321 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.466 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:13:02] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:13:04] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.399 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:13:04] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:13:04] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:13:04] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:13:06] RECOVERY - PHP7 rendering on mw1332 is OK: HTTP OK: HTTP/1.1 200 OK - 76385 bytes in 0.620 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:13:06] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:13:06] RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [16:13:08] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:13:08] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:13:08] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:13:10] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:13:14] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [16:13:14] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:13:16] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:13:16] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:13:16] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:13:16] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:13:18] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:13:18] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:13:24] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:13:26] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:13:26] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:13:34] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:13:44] RECOVERY - phpfpm_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 0.9928 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:13:46] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [16:13:56] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read [16:13:56] ary 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve most-read articles for date with no data (with aggregated=true)) timed out before a response was received: / (spec from root) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a r [16:13:56] ved: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve featured article info for unsupported site (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received: /{domain}/v1/page/news (get In the News content for unsupported language (with aggregated=true)) timed out before a response was received: /{domain}/v1/feed/onthisday/{ty [16:13:56] (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/feed/availability (Retrieve feed content availability from \wikipedia.org\) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/feed/announcements (Retri [16:13:56] ) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [16:14:16] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:14:18] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [16:14:40] PROBLEM - LVS HTTP IPv4 on wikifeeds.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.47 and port 8889: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:15:46] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:15:46] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:15:48] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:16:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:16:04] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:16:18] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:16:18] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:16:18] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:16:20] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:16:20] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:16:28] RECOVERY - LVS HTTP IPv4 on wikifeeds.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 945 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:16:55] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:17:13] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:18:31] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:18:51] (03CR) 10Andrew Bogott: [C: 04-1] "will discuss my concerns offline" [puppet] - 10https://gerrit.wikimedia.org/r/570079 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [16:19:23] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:19:27] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:20:27] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:20:59] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:21:11] (03Abandoned) 10Krinkle: Lower wgHTTPTimeout from default 25s to 0.5s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567365 (owner: 10Krinkle) [16:23:57] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1003.eqiad.wmnet, maps1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:25:11] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:26:03] PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [16:26:11] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:26:11] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:26:13] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [16:26:27] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:26:33] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:27:17] RECOVERY - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [16:27:49] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:27:49] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1017 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [16:29:07] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:29:37] (03PS2) 10Ayounsi: Juniper to Netbox import script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 [16:29:39] (03CR) 10Gilles: "At least one example video should be added to the test suite" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) (owner: 10Brion VIBBER) [16:30:03] (03CR) 10jerkins-bot: [V: 04-1] Juniper to Netbox import script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 (owner: 10Ayounsi) [16:31:25] PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [16:33:34] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10RhinosF1) Noticing alot of slow wikis + report of downtime on mediawiki.org discord - Both pages get a wikimedia timeout mentioned above. [16:34:41] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:35:12] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10Marostegui) >>! In T244058#5848681, @RhinosF1 wrote: > Noticing alot of slow wikis + report of downtime on mediawiki.org discord - Both pages get a w... [16:36:35] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-publish: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315 (10Krinkle) 05Open→03Declined We've migrated from varnish-be to ats-be (see <... [16:36:55] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10RhinosF1) >>! In T244058#5848683, @Marostegui wrote: >>>! In T244058#5848681, @RhinosF1 wrote: >> Noticing alot of slow wikis + report of downtime on... [16:37:07] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:37:23] RECOVERY - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [16:37:26] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10Marostegui) >>! In T244058#5848686, @RhinosF1 wrote: >>>! In T244058#5848683, @Marostegui wrote: >>>>! In T244058#5848681, @RhinosF1 wrote: >>> Notic... [16:39:21] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:39:29] PROBLEM - LVS HTTPS IPv4 #page on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:39:55] PROBLEM - Maps HTTPS on maps1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:41:07] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:42:21] PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [16:42:57] PROBLEM - Maps HTTPS on maps1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:45:14] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) [16:45:22] (03PS1) 10BBlack: Varnish: force XFP: https from the inside [puppet] - 10https://gerrit.wikimedia.org/r/570088 [16:46:01] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [16:46:11] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) This is complete. [16:46:49] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.13 and port 6533: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:47:51] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([maps1002.eqiad.wmnet, maps1001.eqiad.wmnet, maps1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [16:47:56] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10ArthurPSmith) @Addshore and others - the problem has deteriorated since Saturday - see this discussion... [16:48:07] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:48:07] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:48:14] (03PS2) 10Cparle: Remove handler deleted from the MachineVision extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566859 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie) [16:48:15] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian_6533: Servers maps1001.eqiad.wmnet, maps1004.eqiad.wmnet are marked down but pooled: kartotherian-ssl_443: Servers maps1002.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:48:29] PROBLEM - Check systemd state on maps1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:36] (03PS3) 10Cparle: Remove handler deleted from the MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566860 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie) [16:48:37] PROBLEM - Check systemd state on maps1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:39] PROBLEM - Check systemd state on maps1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:59] PROBLEM - Maps HTTPS on maps1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:49:17] PROBLEM - Check systemd state on maps1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:19] PROBLEM - Maps HTTPS on maps1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:49:55] (03PS3) 10Cparle: Remove handler deleted from the MachineVision extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566859 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie) [16:51:49] (03CR) 10Cparle: "> Patch Set 1:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566859 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie) [16:52:03] (03PS1) 10Ema: ATS: temporarily leave AE untouched [puppet] - 10https://gerrit.wikimedia.org/r/570092 (https://phabricator.wikimedia.org/T242478) [16:52:15] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([maps1003.eqiad.wmnet, maps1002.eqiad.wmnet, maps1001.eqiad.wmnet, maps1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [16:52:21] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.16/extensions/Wikibase: fix for the recent outage (duration: 01m 21s) [16:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:54:04] (03PS1) 10Clarakosi: Add restbase202[123] to hiera [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178) [16:57:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:58:44] (03CR) 10Ema: [C: 03+2] ATS: temporarily leave AE untouched [puppet] - 10https://gerrit.wikimedia.org/r/570092 (https://phabricator.wikimedia.org/T242478) (owner: 10Ema) [17:00:04] godog and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200204T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:05:09] (03PS3) 10Jbond: wmflablibs: add new facts for wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/570079 (https://phabricator.wikimedia.org/T244222) [17:07:20] <_joe_> !log restarted php-fpm on mw1264 witrh 240 workers [17:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:34] <_joe_> !log restarted php-fpm on mw1265 witrh 80 workers (teh default) [17:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:36] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.16/includes/filerepo/file/ForeignDBFile.php: gerrit: 570089, ongoing incident (duration: 01m 04s) [17:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:04] <_joe_> !log restarting php-fpm on mw1266-9 [17:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:51] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [17:13:02] <_joe_> !log restarting php-fpm on mw126[1-3] [17:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:47] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops: siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Anomie) Caching in the caching layer might have been helpful for these specific requests, but would that merely have d... [17:27:13] (03CR) 10Cmjohnson: [C: 03+2] Adding netboot and dhcp file for cescout1001 [puppet] - 10https://gerrit.wikimedia.org/r/570081 (https://phabricator.wikimedia.org/T239250) (owner: 10Cmjohnson) [17:28:19] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10Cmjohnson) [17:30:57] 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Cmjohnson) @joe working on these next...will have to you in next day or 2. [17:33:12] (03PS2) 10Brion VIBBER: Support MPEG-1 and MPEG-2 video files with .mpg or .mpeg extension [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) [17:34:12] !log oblivian@cumin1001 conftool action : set/weight=15; selector: cluster=appserver,service=nginx,dc=eqiad,name=mw12[3-5].* [17:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:54] anyone know how to run the thumbor-plugins tests? like, is there a particular operating system or version of python that's required? [17:36:03] i've got a lot of warnings about python 2.7 being end of life :D [17:37:18] wiki page says "Once the tests pass locally, if you're using an OS different than Debian Jessie for local development, you should run the tests again on a WMCS machine. This is because variations in the exact versions of the underlying software, like imagemagick, can result in different visual comparison scores for example from one platform to the next." [17:37:25] which is worrying, as jessie is pretty obsolete iirc [17:37:30] i assume it's just out of date [17:38:59] (03PS3) 10SBassett: Also log authevents channel. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477005 (owner: 10Brian Wolff) [17:40:19] (03CR) 10SBassett: "If this would still be helpful, can we just deploy it and file a bug (if one does not already exist) for ConfirmEdit improvements? I'm ha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477005 (owner: 10Brian Wolff) [17:41:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): relocate/reimage cloudvirt1013 with 10G interfaces - https://phabricator.wikimedia.org/T243414 (10Andrew) 05Open→03Stalled Ah, dammit, dc-ops missed this ticket and now 1013 is back in service on 1G. So it's no longer a g... [17:41:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [17:41:31] RECOVERY - Maps HTTPS on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 2.241 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:41:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [17:41:39] RECOVERY - Check systemd state on maps1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:43] !log reenable kartotherian on maps100* [17:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:47] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 3.856 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:41:47] RECOVERY - Check systemd state on maps1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:49] RECOVERY - Check systemd state on maps1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:01] PROBLEM - kartotherian endpoints health on maps1003 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [17:42:09] RECOVERY - Maps HTTPS on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 4.040 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:42:27] RECOVERY - Check systemd state on maps1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:32] RECOVERY - Maps HTTPS on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 6.426 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:42:39] RECOVERY - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [17:42:53] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:43:05] PROBLEM - kartotherian endpoints health on maps1001 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [17:43:09] RECOVERY - Maps HTTPS on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 6.359 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:43:17] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:44:55] RECOVERY - LVS HTTPS IPv4 #page on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 9.923 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:46:13] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:46:53] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:50:35] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:54:37] PROBLEM - Maps HTTPS on maps1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:56:07] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1002.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:56:27] RECOVERY - Maps HTTPS on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 8.057 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:57:33] !log ✔️ cdanis@mw1272.eqiad.wmnet ~ 🕐☕ sudo restart-php7.2-fpm [17:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:37] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1003.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:00:00] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: Degraded RAID on analytics1030 - https://phabricator.wikimedia.org/T243971 (10Ottomata) 05Open→03Declined analytics1030 is a node in the analytics-test cluster. We will be ordering replacement hardware this year so we aren't going to worry a... [18:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200204T1800). [18:01:27] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:01:35] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:05:17] PROBLEM - Maps HTTPS on maps1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:06:39] (03PS6) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) [18:07:03] RECOVERY - Maps HTTPS on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 4.749 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:08:46] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [18:08:49] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:10:39] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:12:10] rescheduling service checks on remaining icinga alert lvs1015 [18:14:01] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:14:43] (03PS7) 10Hnowlan: mediawiki: check mw versions match those on the deploy server [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) [18:15:20] (03CR) 10Hnowlan: mediawiki: check mw versions match those on the deploy server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [18:15:51] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1003.eqiad.wmnet, maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:16:27] ^ maps1003 on 443 is working [18:17:41] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:18:12] mutante: I think it's just going to be flapping like that for as long as the maps machines are at 100% CPU [18:18:16] confirmed maps1004/maps1001 as well.. pybal should notice [18:18:25] ack [18:19:29] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:20:11] indeed. pybal checks recovered but HTTPS checks on the host themselves flapping instead now [18:21:51] PROBLEM - Maps HTTPS on maps1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:26:51] (03CR) 10Jforrester: [C: 03+1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566410 (owner: 10Legoktm) [18:27:25] RECOVERY - Maps HTTPS on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 9.908 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:28:43] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1003.eqiad.wmnet, maps1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:32:20] (03CR) 10Brion VIBBER: "I tried running the offline tests on Debian Stretch with no success. Lots of complaints like this:" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) (owner: 10Brion VIBBER) [18:33:29] (03PS1) 10Krinkle: contint: Fix doc.wikimedia.org/favicon.ico 404 error [puppet] - 10https://gerrit.wikimedia.org/r/570107 [18:34:13] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:38:02] (03PS5) 10Holger Knust: Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 [18:38:29] PROBLEM - Maps HTTPS on maps1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:38:49] (03CR) 10Jforrester: [C: 03+1] contint: Fix doc.wikimedia.org/favicon.ico 404 error [puppet] - 10https://gerrit.wikimedia.org/r/570107 (owner: 10Krinkle) [18:41:31] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:43:42] PROBLEM - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [18:43:50] (03PS6) 10Holger Knust: Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 [18:43:58] RECOVERY - Maps HTTPS on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 4.340 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:44:39] RECOVERY - kartotherian endpoints health on maps1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [18:45:27] !log cdanis@cumin2001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [18:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:54] (03CR) 10Holger Knust: "Addressed review comments and removed files not needed at this stage" (0327 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [18:50:41] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:50:45] RECOVERY - Kartotherian LVS eqiad #page on kartotherian.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [18:51:05] RECOVERY - kartotherian endpoints health on maps1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [18:56:39] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2004.codfw.wmnet, maps2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:56:47] PROBLEM - Kartotherian LVS codfw #page on kartotherian.svc.codfw.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [18:57:15] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2003.codfw.wmnet, maps2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:59:57] PROBLEM - LVS HTTPS IPv4 #page on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200204T1900) [19:00:49] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:02:12] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:02:37] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 6.843 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:02:45] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:05:43] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 9.058 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:07:19] RECOVERY - LVS HTTPS IPv4 #page on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 7.507 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:07:45] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:09:21] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:09:43] RECOVERY - Kartotherian LVS codfw #page on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [19:13:03] PROBLEM - Host cp3065 is DOWN: PING CRITICAL - Packet loss = 100% [19:13:03] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 7.745 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:13:47] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:15:09] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2003.codfw.wmnet, maps2001.codfw.wmnet, maps2004.codfw.wmnet, maps2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:15:17] !log cp3065 - powercycling [19:15:17] PROBLEM - Kartotherian LVS codfw #page on kartotherian.svc.codfw.wmnet is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [19:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:37] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2001.codfw.wmnet, maps2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:17:55] RECOVERY - Host cp3065 is UP: PING OK - Packet loss = 0%, RTA = 83.32 ms [19:18:49] PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:18:53] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:19:33] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 7.967 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:20:38] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:20:39] RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 8.724 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:22:15] PROBLEM - Maps HTTPS on maps2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:22:30] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 6.907 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:24:12] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:24:53] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:25:49] RECOVERY - Maps HTTPS on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 3.019 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:25:53] PROBLEM - LVS HTTPS IPv4 #page on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:26:17] (03PS13) 10ArielGlenn: write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [19:26:39] (03CR) 10jerkins-bot: [V: 04-1] write out and reuse pagerage info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) (owner: 10ArielGlenn) [19:27:44] RECOVERY - LVS HTTPS IPv4 #page on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 7.737 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:27:53] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 3.668 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:28:05] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:28:34] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 4.446 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:29:52] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 8.569 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:30:02] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2003.codfw.wmnet, maps2001.codfw.wmnet, maps2004.codfw.wmnet, maps2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:31:51] 10Operations: Tracking task: 2020-02-04 kartotherian outage - https://phabricator.wikimedia.org/T244278 (10RLazarus) [19:33:28] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:33:31] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [19:34:22] (03PS1) 10Papaul: DNS; Add mgmt and production DNS for mw2310 to mw2334 [dns] - 10https://gerrit.wikimedia.org/r/570127 [19:37:04] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 5.347 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:37:12] (03PS1) 10CDanis: upload VCL: maps: block Referer from twpkinfo.com [puppet] - 10https://gerrit.wikimedia.org/r/570129 (https://phabricator.wikimedia.org/T244278) [19:37:16] PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:38:35] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10aaron) Links to old (non-current) versions due not use the parser cache. This means that rendering will always require a full parse. Profiling info... [19:40:40] PROBLEM - LVS HTTPS IPv4 #page on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:40:58] RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 9.401 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:41:32] (03PS2) 10Eevans: Configure group1 for kask-transition (multi-write kask/redis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569678 (https://phabricator.wikimedia.org/T243106) [19:42:03] 10Operations, 10Patch-For-Review: Tracking task: 2020-02-04 kartotherian outage - https://phabricator.wikimedia.org/T244278 (10Jdforrester-WMF) p:05Triage→03Unbreak! [19:42:07] (03CR) 10Ppchelko: Configure group1 for kask-transition (multi-write kask/redis) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569678 (https://phabricator.wikimedia.org/T243106) (owner: 10Eevans) [19:42:14] (03PS2) 10CDanis: upload VCL: maps: block Referer from twpkinfo.com [puppet] - 10https://gerrit.wikimedia.org/r/570129 (https://phabricator.wikimedia.org/T244278) [19:42:41] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:42:46] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:43:15] (03PS14) 10ArielGlenn: write out and reuse pagerange info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [19:43:17] (03PS1) 10Cmjohnson: adding mgmt dns for cloudvirt-wdqs1001-3 [dns] - 10https://gerrit.wikimedia.org/r/570130 (https://phabricator.wikimedia.org/T235685) [19:44:06] (03CR) 10CDanis: "0 tests failed, 0 tests skipped, 15 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/570129 (https://phabricator.wikimedia.org/T244278) (owner: 10CDanis) [19:44:22] PROBLEM - Maps HTTPS on maps2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:45:37] (03CR) 10Cmjohnson: [C: 03+2] adding mgmt dns for cloudvirt-wdqs1001-3 [dns] - 10https://gerrit.wikimedia.org/r/570130 (https://phabricator.wikimedia.org/T235685) (owner: 10Cmjohnson) [19:46:20] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 8.591 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:46:23] (03PS3) 10Eevans: Configure group1 for kask-transition (multi-write kask/redis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569678 (https://phabricator.wikimedia.org/T243106) [19:46:26] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 9.041 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:46:30] PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:47:04] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:47:12] (03CR) 10Ema: [C: 03+1] upload VCL: maps: block Referer from twpkinfo.com [puppet] - 10https://gerrit.wikimedia.org/r/570129 (https://phabricator.wikimedia.org/T244278) (owner: 10CDanis) [19:47:13] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): (No Need By Date Provided) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Cmjohnson) @Jclark-ctr if you can do bios and idrac please. Below is the mgmt ip cloudvirt-wdqs... [19:47:16] (03CR) 10CDanis: [C: 03+2] upload VCL: maps: block Referer from twpkinfo.com [puppet] - 10https://gerrit.wikimedia.org/r/570129 (https://phabricator.wikimedia.org/T244278) (owner: 10CDanis) [19:47:58] RECOVERY - Maps HTTPS on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 5.210 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:48:02] RECOVERY - LVS HTTPS IPv4 #page on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 7.451 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:48:20] RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 6.341 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:50:05] (03PS4) 10Eevans: Configure group0 & group1 for kask-transition (multi-write kask/redis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569678 (https://phabricator.wikimedia.org/T243106) [19:52:34] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10Cmjohnson) [19:53:29] (03CR) 10Ppchelko: [C: 03+1] Configure group0 & group1 for kask-transition (multi-write kask/redis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569678 (https://phabricator.wikimedia.org/T243106) (owner: 10Eevans) [19:53:36] PROBLEM - Maps HTTPS on maps2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:53:53] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10Cmjohnson) a:05Cmjohnson→03ssingh @ssingh this server is ready for implementation. I have assigned it t you and removed the ops-eqiad tag [19:54:11] 10Operations, 10Patch-For-Review: setup/install cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T239250 (10Cmjohnson) [19:55:40] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:57:12] RECOVERY - Maps HTTPS on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 6.035 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:57:48] cmjohnson1: thank you for your help [19:58:08] YW [19:58:12] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 9.945 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [19:59:28] PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:00:04] twentyafterfour and marxarelli: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200204T2000). [20:01:20] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Cmjohnson) @jcrespo yes, this seems to be an issue and when it's down I will see if there are any f/w updates for it. Is this something you need right away? I can put it on the schedule for Thur... [20:02:15] (03PS1) 10Volans: sre.switchdc.mediawiki: adapt to current status [cookbooks] - 10https://gerrit.wikimedia.org/r/570131 (https://phabricator.wikimedia.org/T243316) [20:03:04] RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 3.693 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:04:06] (03PS2) 10Volans: sre.switchdc.mediawiki: adapt to current status [cookbooks] - 10https://gerrit.wikimedia.org/r/570131 (https://phabricator.wikimedia.org/T243316) [20:06:00] !log temporarily holding 1.35.0-wmf.18 [T233866] branch cut and train due to concurrent maps prod issues [20:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:04] T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 [20:06:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: Refresh switch ports descriptions for recently renamed cloud servers - https://phabricator.wikimedia.org/T201444 (10Cmjohnson) 05Open→03Resolved all have been updated [20:06:26] marxarelli: I thikn you can proceed [20:06:33] marxarelli: the current issues are only with maps, not mediawiki [20:06:48] kk. thanks! [20:06:50] twentyafterfour: ^ [20:07:16] (03CR) 10Volans: "Adding interested reviewers, feel free to review only the bits you're involved with." [cookbooks] - 10https://gerrit.wikimedia.org/r/570131 (https://phabricator.wikimedia.org/T243316) (owner: 10Volans) [20:07:19] cdanis: marxarelli: thanks [20:08:20] PROBLEM - Maps HTTPS on maps2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:08:40] PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:10:30] RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 7.252 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:11:06] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:15:41] RECOVERY - Maps HTTPS on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 7.576 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:16:04] PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:17:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:17:14] RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 4.350 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:18:13] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 8.889 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:18:54] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 4.798 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:19:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:20:02] !log branching mediawiki to wmf/1.35.0-wmf.18 from commit 054dd94e97d6 - train blockers should be added as subtasks under T233866 [20:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:06] T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 [20:21:18] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:21:26] PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:21:46] twentyafterfour: Oops, did you mis-created 1.34.0-wmf.18? [20:24:12] James_F: yeah already working on fixing that ;) [20:24:21] Cool. [20:24:43] !log temporarily enable access logs on maps2001 [20:24:44] this is the one drawback to those tempting, easy, copy/pastable examples in the docs [20:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:08] (03PS1) 10Andrew Bogott: archive-instances.py: use keystoneauth1.session instead of keystoneclient.session [puppet] - 10https://gerrit.wikimedia.org/r/570132 (https://phabricator.wikimedia.org/T241045) [20:25:30] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:26:04] (03CR) 10jerkins-bot: [V: 04-1] archive-instances.py: use keystoneauth1.session instead of keystoneclient.session [puppet] - 10https://gerrit.wikimedia.org/r/570132 (https://phabricator.wikimedia.org/T241045) (owner: 10Andrew Bogott) [20:26:58] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 9.660 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:27:56] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 9.773 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:28:03] (03PS2) 10Andrew Bogott: archive-instances.py: use keystoneauth1.session [puppet] - 10https://gerrit.wikimedia.org/r/570132 (https://phabricator.wikimedia.org/T241045) [20:28:24] RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 6.409 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:29:24] (03CR) 10Andrew Bogott: [C: 03+2] archive-instances.py: use keystoneauth1.session [puppet] - 10https://gerrit.wikimedia.org/r/570132 (https://phabricator.wikimedia.org/T241045) (owner: 10Andrew Bogott) [20:31:47] !log restart kartotherian on maps2001 [20:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:03] PROBLEM - Maps HTTPS on maps2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:32:18] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:33:10] 10Operations, 10Wikimedia-Incident: Tracking task: 2020-02-04 kartotherian outage - https://phabricator.wikimedia.org/T244278 (10Peachey88) [20:33:36] RECOVERY - Maps HTTPS on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 8.373 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:33:38] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 9.874 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:34:38] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:35:02] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 5.684 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:36:06] PROBLEM - kartotherian endpoints health on maps2004 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [20:40:50] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2001.codfw.wmnet, maps2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:40:58] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:41:28] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2001.codfw.wmnet, maps2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:41:40] PROBLEM - Maps HTTPS on maps2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:42:58] PROBLEM - Maps HTTPS on maps2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:43:08] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:43:26] (03PS1) 10Jeena Huneidi: Add mounts property to scaffold values [deployment-charts] - 10https://gerrit.wikimedia.org/r/570137 [20:44:15] (03PS2) 10Jeena Huneidi: Add mounts property to scaffold values [deployment-charts] - 10https://gerrit.wikimedia.org/r/570137 [20:44:16] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 5.671 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:44:38] RECOVERY - Maps HTTPS on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 8.313 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:46:07] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) @cmjohnson - Thursday sounds good. I can leave the host depooled, downtimed and off, so you can tackle it in your afternoon. Just leave it powered on once you are done and I will take... [20:48:28] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-ssl_443: Servers maps2003.codfw.wmnet, maps2001.codfw.wmnet, maps2004.codfw.wmnet, maps2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:49:23] marxarelli: want to rubber stamp the branch? https://gerrit.wikimedia.org/r/c/mediawiki/core/+/570138 [20:49:41] sure! [20:50:32] it appears that icinga is now happy? should I go ahead with moving testwikis to wmf.18 ? [20:52:17] twentyafterfour: rubber stamped but jobs are queued in zuul :/ [20:52:32] looks like a lot of backup in gate-and-submit currently [20:53:14] twentyafterfour: icinga is flapping about maps ..still ongoing.. but it has been said deployment can go ahead [20:53:28] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:54:08] RECOVERY - Maps HTTPS on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 5.914 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:54:51] (03PS1) 10Alexandros Kosiaris: maps: Only allow our canonical domains [puppet] - 10https://gerrit.wikimedia.org/r/570140 (https://phabricator.wikimedia.org/T244278) [20:55:00] mutante: marxarelli: thanks [20:55:04] PROBLEM - kartotherian endpoints health on maps2004 is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [20:55:06] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:55:16] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 7.048 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [20:56:22] (03PS2) 10Alexandros Kosiaris: maps: Only allow our canonical domains [puppet] - 10https://gerrit.wikimedia.org/r/570140 (https://phabricator.wikimedia.org/T244278) [20:57:25] (03CR) 10Alexandros Kosiaris: "Per brandon," [puppet] - 10https://gerrit.wikimedia.org/r/570140 (https://phabricator.wikimedia.org/T244278) (owner: 10Alexandros Kosiaris) [20:57:57] (03CR) 10CDanis: [C: 03+1] maps: Only allow our canonical domains [puppet] - 10https://gerrit.wikimedia.org/r/570140 (https://phabricator.wikimedia.org/T244278) (owner: 10Alexandros Kosiaris) [20:59:49] (03PS3) 10Alexandros Kosiaris: maps: Only allow our canonical domains [puppet] - 10https://gerrit.wikimedia.org/r/570140 (https://phabricator.wikimedia.org/T244278) [21:00:28] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570140 (https://phabricator.wikimedia.org/T244278) (owner: 10Alexandros Kosiaris) [21:01:19] (03CR) 10Herron: maps: Only allow our canonical domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570140 (https://phabricator.wikimedia.org/T244278) (owner: 10Alexandros Kosiaris) [21:01:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] maps: Only allow our canonical domains [puppet] - 10https://gerrit.wikimedia.org/r/570140 (https://phabricator.wikimedia.org/T244278) (owner: 10Alexandros Kosiaris) [21:02:29] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Itamar Givon to the ldap/wmde group - https://phabricator.wikimedia.org/T244148 (10RStallman-legalteam) Thanks @ItamarWMDE, The NDA is fully signed and filed. I think the next steps are with the SRE team to get you access. [21:04:32] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [21:04:58] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [21:06:10] PROBLEM - kartotherian endpoints health on maps2004 is CRITICAL: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [21:06:13] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [21:06:48] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1286 bytes in 8.926 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [21:06:50] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:07:36] RECOVERY - Kartotherian LVS codfw #page on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps%23Kartotherian [21:07:52] RECOVERY - kartotherian endpoints health on maps2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [21:08:43] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10Daimona) I forgot to say that the second example in the task description is unrelated. It was discussed with ops earlier today, and my comment can be... [21:15:27] (03CR) 10Herron: maps: Only allow our canonical domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570140 (https://phabricator.wikimedia.org/T244278) (owner: 10Alexandros Kosiaris) [21:18:40] (03PS1) 10CDanis: maps: allow for possibility of trailing slash on our referers [puppet] - 10https://gerrit.wikimedia.org/r/570143 (https://phabricator.wikimedia.org/T244278) [21:19:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:20:39] (03PS2) 10CDanis: maps: allow for possibility of trailing slash on our referers [puppet] - 10https://gerrit.wikimedia.org/r/570143 (https://phabricator.wikimedia.org/T244278) [21:21:02] (03CR) 10RLazarus: [C: 03+1] maps: allow for possibility of trailing slash on our referers [puppet] - 10https://gerrit.wikimedia.org/r/570143 (https://phabricator.wikimedia.org/T244278) (owner: 10CDanis) [21:21:57] (03CR) 10CDanis: [C: 03+2] maps: allow for possibility of trailing slash on our referers [puppet] - 10https://gerrit.wikimedia.org/r/570143 (https://phabricator.wikimedia.org/T244278) (owner: 10CDanis) [21:26:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:27:45] (03PS15) 10ArielGlenn: write out and reuse pagerange info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [21:29:55] !log preparing the new mediawiki branch for deployment to test wikis [21:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:19] (03PS1) 1020after4: testwikis wikis to 1.35.0-wmf.18 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570144 [21:34:22] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.35.0-wmf.18 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570144 (owner: 1020after4) [21:35:22] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.18 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570144 (owner: 1020after4) [21:37:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:41:48] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.18 refs T233866 [21:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:51] T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 [21:44:17] (03PS1) 10CDanis: maps block: improve matching of the Referer regex [puppet] - 10https://gerrit.wikimedia.org/r/570147 [21:44:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:47:47] (03CR) 10Krinkle: maps block: improve matching of the Referer regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570147 (owner: 10CDanis) [21:49:04] (03PS2) 10CDanis: maps block: improve matching of the Referer regex [puppet] - 10https://gerrit.wikimedia.org/r/570147 [21:49:19] twentyafterfour: only seeing timeout noise [21:50:13] (03CR) 10RLazarus: [C: 03+1] maps block: improve matching of the Referer regex [puppet] - 10https://gerrit.wikimedia.org/r/570147 (owner: 10CDanis) [21:50:55] marxarelli: yeah but I really don't like the idea of rolling the train when there is a lot of background noise that could obscure other issues [21:51:20] oh i thought you'd rolled it! [21:51:25] i see now, only testwiki [21:51:30] it's still syncing the branch [21:51:34] kk [21:51:41] so hasn't even reached testwikis yet [21:51:45] right [21:51:58] (03CR) 10CDanis: [C: 03+2] maps block: improve matching of the Referer regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570147 (owner: 10CDanis) [21:52:13] then those timeouts are maps related? [21:59:24] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 58232536 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:01:12] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 33792 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:03:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:03:53] !log cdanis@cumin2001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [22:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:54] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:13:51] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.18 refs T233866 (duration: 32m 03s) [22:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:54] T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 [22:15:44] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 34 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:21:18] (03CR) 10Papaul: [C: 03+1] add IP addresses for new install servers on buster [dns] - 10https://gerrit.wikimedia.org/r/569679 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [22:22:56] (03CR) 10Papaul: [C: 03+2] DNS; Add mgmt and production DNS for mw2310 to mw2334 [dns] - 10https://gerrit.wikimedia.org/r/570127 (owner: 10Papaul) [22:23:09] (03PS2) 10Papaul: DNS; Add mgmt and production DNS for mw2310 to mw2334 [dns] - 10https://gerrit.wikimedia.org/r/570127 [22:23:21] twentyafterfour: Time to roll to group0? Testwiki looks OK to me. [22:26:19] (03PS1) 10BBlack: [WIP] maps: block 3rd parties with 403, even hits [puppet] - 10https://gerrit.wikimedia.org/r/570156 (https://phabricator.wikimedia.org/T244278) [22:26:35] James_F: sure [22:27:04] (03PS1) 1020after4: group0 wikis to 1.35.0-wmf.18 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570157 [22:27:06] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.35.0-wmf.18 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570157 (owner: 1020after4) [22:27:42] 10Operations, 10netops, 10cloud-services-team (Kanban): WMCS: cleanup network allocations - https://phabricator.wikimedia.org/T240670 (10ayounsi) 05Open→03Resolved I think everything is done here? [22:27:51] (03CR) 10Thcipriani: [C: 03+1] icinga: add irc,irc-releng to phabricator contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/567550 (owner: 10Dzahn) [22:28:15] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.18 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570157 (owner: 1020after4) [22:29:15] (03PS4) 10Dzahn: define 2 API appservers per row in codfw as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/564175 (https://phabricator.wikimedia.org/T242606) [22:32:33] 10Operations, 10SRE-tools, 10netops, 10Goal, and 2 others: Configuration management for network operations - https://phabricator.wikimedia.org/T228388 (10ayounsi) [22:32:57] (03PS1) 10Volans: mediawiki: use cumin alias instead of role query [software/spicerack] - 10https://gerrit.wikimedia.org/r/570159 (https://phabricator.wikimedia.org/T243935) [22:32:59] (03PS1) 10Volans: dnsdisc: fix typo in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/570160 [22:33:01] (03PS1) 10Volans: mysql: adapt Cumin queries to select DBs [software/spicerack] - 10https://gerrit.wikimedia.org/r/570161 (https://phabricator.wikimedia.org/T243935) [22:33:03] (03CR) 1020after4: [C: 03+1] icinga: add twentyafterfour to gerrit contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/567544 (owner: 10Dzahn) [22:33:43] James_F: syncing to group0 now [22:34:07] 10Operations, 10SRE-tools, 10netops, 10Goal, and 2 others: Configuration management for network operations - https://phabricator.wikimedia.org/T228388 (10ayounsi) Everything here is done. Doc is there https://wikitech.wikimedia.org/wiki/Homer and has been tested by other SREs than Riccardo or me. Future d... [22:34:17] 10Operations, 10SRE-tools, 10netops, 10Goal, and 2 others: Configuration management for network operations - https://phabricator.wikimedia.org/T228388 (10ayounsi) 05Open→03Resolved [22:35:16] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.18 refs T233866 [22:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:20] T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 [22:37:00] 10Operations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [22:37:29] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: use cumin alias instead of role query [software/spicerack] - 10https://gerrit.wikimedia.org/r/570159 (https://phabricator.wikimedia.org/T243935) (owner: 10Volans) [22:37:38] (03CR) 10jerkins-bot: [V: 04-1] dnsdisc: fix typo in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/570160 (owner: 10Volans) [22:38:18] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:38:36] (03CR) 10jerkins-bot: [V: 04-1] mysql: adapt Cumin queries to select DBs [software/spicerack] - 10https://gerrit.wikimedia.org/r/570161 (https://phabricator.wikimedia.org/T243935) (owner: 10Volans) [22:39:12] that looks deployment related ^ [22:44:21] https://phabricator.wikimedia.org/T244300 [22:44:24] 10Operations, 10Performance-Team, 10SRE-Access-Requests: Requesting access to deployment for dpifke - https://phabricator.wikimedia.org/T244183 (10greg) Approved from my end. [22:44:35] I just filed https://phabricator.wikimedia.org/T244299, same issue [22:44:53] beat you by 3 minutes :) [22:45:01] (03PS2) 10Volans: mediawiki: use cumin alias instead of role query [software/spicerack] - 10https://gerrit.wikimedia.org/r/570159 (https://phabricator.wikimedia.org/T243935) [22:45:03] (03PS2) 10Volans: dnsdisc: fix typo in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/570160 [22:45:05] (03PS2) 10Volans: mysql: adapt Cumin queries to select DBs [software/spicerack] - 10https://gerrit.wikimedia.org/r/570161 (https://phabricator.wikimedia.org/T243935) [22:45:36] musikanimal: it's a race! [22:45:46] thanks for reporting it [22:45:53] yours is more properly formatted, feel free to merge [22:47:17] that is amazing that it has logstash integration. I'll remember to use that subtype next time [22:48:21] musikanimal: there is a tab in kibana called "phatality" which offers a button to submit an error directly to phab without copying/pasting all the details, phatality formats the task for you like you see in T244300 [22:48:21] T244300: Argument 1 passed to Title::getLanguageConverter() must be an instance of Language, instance of StubUserLang given, called in /srv/mediawiki/php-1.35.0-wmf.18/includes/Title.php on line 207 - https://phabricator.wikimedia.org/T244300 [22:49:58] to get to the phatality submission button you first expand an error in the events list (click the little arrow in the far left column) then click "phatality" in the area that expands below the row you clicked. Then clcik the "+submit" button which will take you to the phabricator task form with all the details pre-filled for you [22:50:12] musikanimal: ^ fyi [22:50:38] Almost certainly caused by https://github.com/wikimedia/mediawiki/commit/61e0908fa2915d73243686c4013f0af244fbc7f2 [22:51:06] thanks [22:53:40] very cool! [22:55:36] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:06:38] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:07:44] (03CR) 10Volans: mysql: adapt Cumin queries to select DBs (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/570161 (https://phabricator.wikimedia.org/T243935) (owner: 10Volans) [23:08:58] 10Operations, 10SRE-tools, 10Patch-For-Review: Audit all cumin queries in switchdc scripts - https://phabricator.wikimedia.org/T243935 (10Volans) p:05Triage→03Normal [23:11:52] (03PS1) 10Mholloway: Add chart for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733) [23:29:31] (03CR) 10Dzahn: [C: 03+2] icinga: add twentyafterfour to gerrit contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/567544 (owner: 10Dzahn) [23:29:43] 10Operations, 10Beta-Cluster-Infrastructure: Puppet broken on Beta Cluster app server - https://phabricator.wikimedia.org/T244306 (10Krinkle) [23:30:38] (03CR) 10Dzahn: [C: 03+2] icinga: add irc,irc-releng to phabricator contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/567550 (owner: 10Dzahn) [23:30:47] (03PS2) 10Dzahn: icinga: add irc,irc-releng to phabricator contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/567550 [23:30:56] 10Operations, 10Beta-Cluster-Infrastructure: Puppet broken on Beta Cluster app server - https://phabricator.wikimedia.org/T244306 (10Krinkle) [23:31:00] (03CR) 10Thcipriani: mediawiki: check mw versions match those on the deploy server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566708 (https://phabricator.wikimedia.org/T242023) (owner: 10Hnowlan) [23:31:32] 10Operations, 10Beta-Cluster-Infrastructure: Puppet broken on Beta Cluster app server - https://phabricator.wikimedia.org/T244306 (10Krenair) [23:31:34] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep - https://phabricator.wikimedia.org/T243226 (10Krenair) [23:34:23] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krinkle) [23:41:50] (03CR) 10Dzahn: [C: 03+2] contint: Fix doc.wikimedia.org/favicon.ico 404 error [puppet] - 10https://gerrit.wikimedia.org/r/570107 (owner: 10Krinkle) [23:42:23] (03PS1) 10Papaul: DHCP: Add MAC address entries for mw2310 to mw2334 [puppet] - 10https://gerrit.wikimedia.org/r/570166 (https://phabricator.wikimedia.org/T241852) [23:44:36] (03CR) 10Krinkle: "Unable to test currently due to T243226." [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [23:52:49] (03CR) 10Dzahn: [C: 03+1] "looks good to me, one tiny nitpick. don't see obvious. did not actually check the MAC addresses on DRAC." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570166 (https://phabricator.wikimedia.org/T241852) (owner: 10Papaul) [23:57:21] (03PS2) 10Papaul: DHCP: Add MAC address entries for mw2310 to mw2334 [puppet] - 10https://gerrit.wikimedia.org/r/570166 (https://phabricator.wikimedia.org/T241852)