[00:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200205T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:40] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10thcipriani) >>! In T243808#5835393, @Dzahn wrote: > @thcipriani ^ This is back to 94% as of right now after ^. And it's been downtime for a month. Is the test instance usable with the current size? > > Als... [00:01:51] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address entries for mw2310 to mw2334 [puppet] - 10https://gerrit.wikimedia.org/r/570166 (https://phabricator.wikimedia.org/T241852) (owner: 10Papaul) [00:03:04] 10Operations, 10DBA, 10Privacy Engineering, 10Traffic, and 4 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499 (10JFishback_WMF) [00:05:54] (03PS1) 10Legoktm: Remove initial attempt at libraryupgrader puppetization [puppet] - 10https://gerrit.wikimedia.org/r/570169 (https://phabricator.wikimedia.org/T173478) [00:47:01] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [00:57:16] (03CR) 10Volans: [C: 03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/570054 (owner: 10Muehlenhoff) [01:32:34] 10Operations, 10WMF-Blog-Social-Team, 10WMF-Communications, 10Wikimedia-Mailing-lists: Delete mailing list "worldcup2018" - https://phabricator.wikimedia.org/T244316 (10Aklapper) [01:40:52] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10jcrespo) > > > yes, this seems to be an issue > I hope you understood this was a rant directed towards the machine/vendor only and for background info. I don't think we will get rid of it until we... [01:52:13] (03PS1) 10Jdlrobson: Drop enwiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) [01:53:19] (03CR) 10jerkins-bot: [V: 04-1] Drop enwiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [01:58:02] (03CR) 10Jdlrobson: "composer buildDBLists is not working for me locally so am hoping deployer can help me fix that part." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [02:03:43] 10Operations, 10Traffic, 10Wikimedia-Blog, 10HTTPS: Change automatic shortlink in blog theme - https://phabricator.wikimedia.org/T165511 (10Varnent) 05Open→03Declined This site has been closed and is no longer being actively developed. [02:06:03] 10Operations, 10Traffic, 10Wikimedia-Blog, 10HTTPS: make blog links from wmfwiki front page use HTTPS links - https://phabricator.wikimedia.org/T104728 (10Varnent) [02:12:22] (03PS1) 10Jdlrobson: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) [02:13:29] (03CR) 10jerkins-bot: [V: 04-1] wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [02:19:49] 10Operations, 10WMF-Blog-Social-Team, 10WMF-Communications, 10Wikimedia-Mailing-lists: Delete mailing list "worldcup2018" - https://phabricator.wikimedia.org/T244316 (10Zoranzoki21) Yes, this is right! [02:38:42] !log T243634 ✔️ cdanis@cp4030.ulsfo.wmnet ~ 🕤🍺 sudo varnish-frontend-restart [02:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:47] T243634: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 [03:17:54] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) |servers|ready for service| |mw2310|yes| |mw2311|yes| |mw2312|yes| |mw2313|yes| |mw2314|yes| |mw2315|yes| |mw2316|yes| |mw2317|yes| |mw2318|yes| |mw2319|yes| |mw2320|yes| |m... [05:07:56] (03PS2) 10Ammarpad: Drop enwiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [05:22:00] (03CR) 10Jdlrobson: [C: 04-1] "Note to deployer: I only want to test this config change on wikimedia debug - I don't want to deploy this right now. I'll sync with deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [05:27:17] PROBLEM - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 1.002e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [05:43:59] PROBLEM - Maps - OSM synchronization lag - eqiad on icinga1001 is CRITICAL: 1.003e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [06:03:37] (03PS1) 10Marostegui: db2086: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570192 (https://phabricator.wikimedia.org/T239453) [06:04:47] (03CR) 10Marostegui: [C: 03+2] db2086: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570192 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [06:07:55] (03PS1) 10Marostegui: db1098: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570194 (https://phabricator.wikimedia.org/T239453) [06:09:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2085:3311, db2086:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10311 and previous config saved to /var/cache/conftool/dbconfig/20200205-060911-marostegui.json [06:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:16] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:09:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3317 - T239453', diff saved to https://phabricator.wikimedia.org/P10312 and previous config saved to /var/cache/conftool/dbconfig/20200205-060942-marostegui.json [06:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:53] (03CR) 10Marostegui: [C: 03+2] db1098: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/570194 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [06:12:53] !log Remove partitions from revision table db1098:3317 - T239453 [06:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:38] PROBLEM - DPKG on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:27:38] PROBLEM - Check size of conntrack table on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:27:38] PROBLEM - Check size of conntrack table on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:27:42] PROBLEM - puppet last run on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:27:42] PROBLEM - puppet last run on ores2007 is CRITICAL: connect to address 10.192.48.88 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:27:42] PROBLEM - dhclient process on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:27:44] PROBLEM - Check systemd state on ores1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:44] PROBLEM - Check whether ferm is active by checking the default input chain on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:27:44] PROBLEM - ores uWSGI web app on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:27:46] PROBLEM - DPKG on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:27:46] PROBLEM - Check systemd state on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:46] PROBLEM - configured eth on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:27:48] PROBLEM - dhclient process on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:27:50] PROBLEM - Check systemd state on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:50] PROBLEM - MD RAID on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:27:50] PROBLEM - Disk space on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1001&var-datasource=eqiad+prometheus/ops [06:27:50] PROBLEM - ores uWSGI web app on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:27:51] PROBLEM - configured eth on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:27:51] PROBLEM - puppet last run on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:27:52] PROBLEM - dhclient process on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:27:52] PROBLEM - puppet last run on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:27:52] PROBLEM - DPKG on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:27:53] PROBLEM - Check whether ferm is active by checking the default input chain on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:27:54] PROBLEM - MD RAID on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:27:54] PROBLEM - Disk space on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2008&var-datasource=codfw+prometheus/ops [06:27:54] PROBLEM - dhclient process on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:27:55] PROBLEM - puppet last run on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:27:55] PROBLEM - Check size of conntrack table on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:27:56] PROBLEM - ores uWSGI web app on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:27:56] PROBLEM - Check systemd state on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:57] PROBLEM - ores uWSGI web app on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:27:57] PROBLEM - configured eth on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:27:58] PROBLEM - dhclient process on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:27:58] PROBLEM - dhclient process on ores2007 is CRITICAL: connect to address 10.192.48.88 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:27:59] PROBLEM - DPKG on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:27:59] PROBLEM - configured eth on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:28:00] PROBLEM - dhclient process on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:28:00] PROBLEM - MD RAID on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:28:01] PROBLEM - Check whether ferm is active by checking the default input chain on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:28:01] PROBLEM - configured eth on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:28:02] PROBLEM - Check size of conntrack table on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:28:02] PROBLEM - Check whether ferm is active by checking the default input chain on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:28:03] PROBLEM - ores uWSGI web app on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:28:03] PROBLEM - Check whether ferm is active by checking the default input chain on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:28:04] PROBLEM - Check systemd state on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:04] PROBLEM - configured eth on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:28:05] PROBLEM - dhclient process on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:28:06] PROBLEM - Disk space on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2001&var-datasource=codfw+prometheus/ops [06:28:06] PROBLEM - ores uWSGI web app on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:28:06] PROBLEM - ores uWSGI web app on ores2007 is CRITICAL: connect to address 10.192.48.88 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:28:07] PROBLEM - DPKG on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:28:08] PROBLEM - configured eth on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:28:10] PROBLEM - Check whether ferm is active by checking the default input chain on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:28:10] PROBLEM - DPKG on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:28:12] PROBLEM - DPKG on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:28:12] PROBLEM - DPKG on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:28:12] PROBLEM - MD RAID on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:28:16] PROBLEM - Disk space on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2005&var-datasource=codfw+prometheus/ops [06:28:20] PROBLEM - configured eth on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:28:20] PROBLEM - MD RAID on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:28:24] PROBLEM - Check size of conntrack table on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:28:24] PROBLEM - Check size of conntrack table on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:28:24] PROBLEM - Check size of conntrack table on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:28:26] PROBLEM - Check systemd state on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:30] PROBLEM - puppet last run on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:28:32] PROBLEM - dhclient process on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:28:32] PROBLEM - Disk space on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1003&var-datasource=eqiad+prometheus/ops [06:28:32] PROBLEM - Disk space on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops [06:28:32] PROBLEM - Check size of conntrack table on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:28:34] PROBLEM - DPKG on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:28:40] PROBLEM - Check systemd state on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:40] PROBLEM - Check systemd state on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:40] PROBLEM - configured eth on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:28:43] PROBLEM - MD RAID on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:28:46] PROBLEM - Check whether ferm is active by checking the default input chain on ores2005 is CRITICAL: connect to address 10.192.32.173 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:28:46] PROBLEM - dhclient process on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:28:50] PROBLEM - Check whether ferm is active by checking the default input chain on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:28:50] PROBLEM - Check whether ferm is active by checking the default input chain on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:28:50] PROBLEM - Disk space on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2009&var-datasource=codfw+prometheus/ops [06:28:52] PROBLEM - Check whether ferm is active by checking the default input chain on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:28:52] PROBLEM - ores uWSGI web app on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:28:52] PROBLEM - Check systemd state on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:52] PROBLEM - DPKG on ores2009 is CRITICAL: connect to address 10.192.48.90 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:28:52] PROBLEM - MD RAID on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:28:54] PROBLEM - DPKG on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:28:54] PROBLEM - Check size of conntrack table on ores2002 is CRITICAL: connect to address 10.192.0.18 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:28:56] PROBLEM - configured eth on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:28:56] PROBLEM - dhclient process on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:28:56] PROBLEM - ores uWSGI web app on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:28:56] PROBLEM - DPKG on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:28:56] PROBLEM - MD RAID on ores2006 is CRITICAL: connect to address 10.192.32.174 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:28:58] PROBLEM - Disk space on ores1009 is CRITICAL: connect to address 10.64.48.28 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1009&var-datasource=eqiad+prometheus/ops [06:28:58] PROBLEM - Check size of conntrack table on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:00] PROBLEM - Check size of conntrack table on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:06] PROBLEM - dhclient process on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:06] PROBLEM - Check size of conntrack table on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:06] PROBLEM - Check systemd state on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:18] RECOVERY - dhclient process on ores2007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:46] PROBLEM - Check systemd state on ores2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:48] PROBLEM - puppet last run on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:30:14] RECOVERY - DPKG on ores2006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:14] RECOVERY - MD RAID on ores2006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:24] RECOVERY - dhclient process on ores2006 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:26] RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:28] PROBLEM - Check the NTP synchronisation status of timesyncd on ores1004 is CRITICAL: connect to address 10.64.16.95 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:30:48] RECOVERY - configured eth on ores2006 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:31:06] RECOVERY - Check size of conntrack table on ores2006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:31:14] PROBLEM - puppet last run on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:31:24] PROBLEM - ores uWSGI web app on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:31:24] PROBLEM - Check systemd state on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:24] PROBLEM - MD RAID on ores2001 is CRITICAL: connect to address 10.192.0.12 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:31:24] PROBLEM - Check systemd state on ores2004 is CRITICAL: connect to address 10.192.16.64 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:22] PROBLEM - puppet last run on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:22] PROBLEM - puppet last run on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:30] !log force a puppet run on ores* hosts [06:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:48] PROBLEM - puppet last run on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:52] RECOVERY - Check whether ferm is active by checking the default input chain on ores2005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:32:52] RECOVERY - dhclient process on ores2001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:32:56] RECOVERY - Disk space on ores2009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2009&var-datasource=codfw+prometheus/ops [06:32:58] RECOVERY - DPKG on ores2009 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:33:00] RECOVERY - Check size of conntrack table on ores2002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:33:00] RECOVERY - configured eth on ores2001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:33:00] RECOVERY - dhclient process on ores2004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:33:08] RECOVERY - Check size of conntrack table on ores2005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:33:08] RECOVERY - Check size of conntrack table on ores2009 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:33:08] RECOVERY - puppet last run on ores2007 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:08] RECOVERY - puppet last run on ores2005 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:12] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:14] RECOVERY - Check whether ferm is active by checking the default input chain on ores2009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:33:16] RECOVERY - Check systemd state on ores2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:16] RECOVERY - configured eth on ores2002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:33:16] RECOVERY - puppet last run on ores2002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:22] RECOVERY - Check whether ferm is active by checking the default input chain on ores2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:33:22] RECOVERY - Disk space on ores2008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2008&var-datasource=codfw+prometheus/ops [06:33:22] RECOVERY - MD RAID on ores2005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:33:24] RECOVERY - Check size of conntrack table on ores2001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:33:26] RECOVERY - dhclient process on ores2005 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:33:26] RECOVERY - DPKG on ores2008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:33:28] RECOVERY - configured eth on ores2005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:33:28] RECOVERY - Check whether ferm is active by checking the default input chain on ores2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:33:30] RECOVERY - Check whether ferm is active by checking the default input chain on ores2004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:33:32] RECOVERY - Check systemd state on ores2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:34] RECOVERY - dhclient process on ores2008 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:33:34] RECOVERY - Disk space on ores2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2001&var-datasource=codfw+prometheus/ops [06:33:36] RECOVERY - DPKG on ores2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:33:38] PROBLEM - Disk space on ores1006 is CRITICAL: connect to address 10.64.32.15 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1006&var-datasource=eqiad+prometheus/ops [06:33:38] PROBLEM - ores uWSGI web app on ores1003 is CRITICAL: connect to address 10.64.16.94 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:33:38] RECOVERY - configured eth on ores2008 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:33:40] RECOVERY - DPKG on ores2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:33:42] RECOVERY - DPKG on ores2004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:33:44] RECOVERY - MD RAID on ores2008 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:33:48] RECOVERY - Disk space on ores2005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2005&var-datasource=codfw+prometheus/ops [06:33:54] RECOVERY - Check size of conntrack table on ores2004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:33:54] RECOVERY - Check size of conntrack table on ores2008 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:33:56] RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:58] RECOVERY - Check systemd state on ores2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:02] RECOVERY - dhclient process on ores2009 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:34:16] RECOVERY - Check systemd state on ores2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:16] RECOVERY - MD RAID on ores2001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:34:22] RECOVERY - Check systemd state on ores1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:30] RECOVERY - Disk space on ores1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1009&var-datasource=eqiad+prometheus/ops [06:34:30] RECOVERY - Check size of conntrack table on ores1003 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:34:34] RECOVERY - Check size of conntrack table on ores1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:34:34] RECOVERY - DPKG on ores1005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:34:40] RECOVERY - dhclient process on ores1003 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:34:40] RECOVERY - Check size of conntrack table on ores1006 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:34:44] RECOVERY - DPKG on ores1006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:34:46] RECOVERY - dhclient process on ores1002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:34:48] RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:48] RECOVERY - MD RAID on ores1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:34:50] RECOVERY - Disk space on ores1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1001&var-datasource=eqiad+prometheus/ops [06:34:50] RECOVERY - configured eth on ores1009 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:34:50] RECOVERY - dhclient process on ores1005 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:34:50] RECOVERY - DPKG on ores1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:34:54] RECOVERY - dhclient process on ores1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:34:54] RECOVERY - configured eth on ores1003 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:34:56] RECOVERY - configured eth on ores1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:34:56] RECOVERY - dhclient process on ores1009 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:34:56] RECOVERY - MD RAID on ores1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:34:58] RECOVERY - Check size of conntrack table on ores1002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:34:58] RECOVERY - Check whether ferm is active by checking the default input chain on ores1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:35:10] RECOVERY - Disk space on ores1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1006&var-datasource=eqiad+prometheus/ops [06:35:10] RECOVERY - Check whether ferm is active by checking the default input chain on ores1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:35:14] RECOVERY - DPKG on ores1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:35:20] RECOVERY - puppet last run on ores2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:35:22] RECOVERY - configured eth on ores1006 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:35:22] RECOVERY - MD RAID on ores1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:35:36] RECOVERY - Disk space on ores1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1003&var-datasource=eqiad+prometheus/ops [06:35:36] RECOVERY - Disk space on ores1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops [06:35:36] RECOVERY - Check size of conntrack table on ores1005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:35:38] RECOVERY - DPKG on ores1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:35:42] RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:42] RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:42] RECOVERY - configured eth on ores1005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:35:44] RECOVERY - MD RAID on ores1009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:35:46] RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:54] RECOVERY - Check whether ferm is active by checking the default input chain on ores1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:35:54] RECOVERY - Check whether ferm is active by checking the default input chain on ores1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:35:54] RECOVERY - Check whether ferm is active by checking the default input chain on ores1009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:35:56] RECOVERY - MD RAID on ores1006 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:35:58] RECOVERY - DPKG on ores1009 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:36:21] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10elukey) Happened again today at 06:25UTC when logrotate ran, forced a puppet run on all hosts to recover quickly. [06:36:46] RECOVERY - puppet last run on ores2008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:37:54] RECOVERY - puppet last run on ores1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:37:54] RECOVERY - puppet last run on ores1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:38:20] RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:38:48] RECOVERY - puppet last run on ores1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:38:50] RECOVERY - puppet last run on ores1009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:39:30] RECOVERY - puppet last run on ores1005 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:40:45] (03PS2) 10Jdlrobson: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) [06:41:12] RECOVERY - Check systemd state on ores2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:54] (03CR) 10jerkins-bot: [V: 04-1] wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [06:42:04] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10elukey) The 7 days view shows a nice increase of memory usage: https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=now-7d&to=now&var-datasource=eqiad%20p... [06:47:18] (03PS3) 10Jdlrobson: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) [06:48:20] (03CR) 10jerkins-bot: [V: 04-1] wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [06:51:06] (03PS4) 10Jdlrobson: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) [06:51:57] (03CR) 10jerkins-bot: [V: 04-1] wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [06:59:44] (03PS5) 10Jdlrobson: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) [07:00:51] RECOVERY - Check the NTP synchronisation status of timesyncd on ores1004 is OK: OK: synced at Wed 2020-02-05 07:00:50 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:02:49] !log Replay s1 traffic on db1107 (10.4) T242702 [07:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:53] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [07:15:26] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10elukey) There you go: https://tools.wmflabs.org/sal/log/AXAM254BfYQT6VcDATbh The deployment matches the start of the memory growth, @Halfak do you have any idea what caused... [07:17:54] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10Joe) >>! In T244058#5849290, @aaron wrote: > Links to old (non-current) versions due not use the parser cache. This means that rendering will always... [07:57:25] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:58:09] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:12:19] marostegui: I have been doing the little clinic work since last week [08:12:24] can you add me to the topic ? [08:13:24] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10Marostegui) [08:13:27] sure [08:13:29] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10akosiaris) T243451 does explain the higher memory usage. It even points out that the higher memory usage is worrisome, however it was deployed anyway. [08:13:48] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Switch to yaml.safe_load to loading update spec files [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/570054 (owner: 10Muehlenhoff) [08:14:53] (03PS1) 10Vgutierrez: admin: Add additional SSH key for vgutierrez [puppet] - 10https://gerrit.wikimedia.org/r/570241 [08:32:16] (03CR) 10Vgutierrez: [C: 03+2] admin: Add additional SSH key for vgutierrez [puppet] - 10https://gerrit.wikimedia.org/r/570241 (owner: 10Vgutierrez) [08:52:17] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 49 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:58:07] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 34 probes of 521 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:05:01] !log add individual FortiGate IPs hitting ulsfo (currently cp4028) to vcl blocked_nets -- trying to identify problematic traffic T243634 [09:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:05] T243634: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 [09:15:02] (03PS1) 10Ema: vcl: apply blocked_nets acl before request normalization [puppet] - 10https://gerrit.wikimedia.org/r/570247 (https://phabricator.wikimedia.org/T243634) [09:20:30] 10Operations, 10ops-eqiad, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10ArielGlenn) How does the above ETA look, now that all hands as done and you have a better idea of what's on your plate? [09:21:07] (03PS1) 10Elukey: presto: add kerberos and tls support [puppet] - 10https://gerrit.wikimedia.org/r/570248 [09:22:36] (03PS2) 10Ema: vcl: block requests before Host normalization and switching vcl [puppet] - 10https://gerrit.wikimedia.org/r/570247 (https://phabricator.wikimedia.org/T243634) [09:25:40] (03PS1) 10Elukey: Add fake passwords for Presto TLS [labs/private] - 10https://gerrit.wikimedia.org/r/570250 [09:25:48] (03CR) 10Vgutierrez: [C: 03+1] vcl: block requests before Host normalization and switching vcl [puppet] - 10https://gerrit.wikimedia.org/r/570247 (https://phabricator.wikimedia.org/T243634) (owner: 10Ema) [09:26:01] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake passwords for Presto TLS [labs/private] - 10https://gerrit.wikimedia.org/r/570250 (owner: 10Elukey) [09:26:59] (03CR) 10Ema: [C: 03+2] vcl: block requests before Host normalization and switching vcl [puppet] - 10https://gerrit.wikimedia.org/r/570247 (https://phabricator.wikimedia.org/T243634) (owner: 10Ema) [09:27:34] elukey: OK to puppet-merge your Presto change? [09:28:50] elukey: nananananananana? [09:29:06] (03PS2) 10Elukey: presto: add kerberos and tls support [puppet] - 10https://gerrit.wikimedia.org/r/570248 [09:29:52] ema: ahahah yes! [09:30:17] elukey: excellent, done :) [09:31:01] (03CR) 10Muehlenhoff: presto: add kerberos and tls support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570248 (owner: 10Elukey) [09:33:00] (03CR) 10Elukey: presto: add kerberos and tls support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570248 (owner: 10Elukey) [09:39:53] (03PS1) 10Elukey: Add fake Java keystore/truststore for the Presto test cluster [labs/private] - 10https://gerrit.wikimedia.org/r/570251 [09:40:51] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake Java keystore/truststore for the Presto test cluster [labs/private] - 10https://gerrit.wikimedia.org/r/570251 (owner: 10Elukey) [09:41:40] (03PS3) 10Elukey: presto: add kerberos and tls support [puppet] - 10https://gerrit.wikimedia.org/r/570248 [09:42:01] (03PS1) 10Vgutierrez: requests: Use POST-as-GET to fetch the issued certificate [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570252 (https://phabricator.wikimedia.org/T244236) [09:44:35] (03CR) 10jerkins-bot: [V: 04-1] requests: Use POST-as-GET to fetch the issued certificate [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570252 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [09:47:15] (03PS2) 10Vgutierrez: requests: Use POST-as-GET to fetch the issued certificate [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570252 (https://phabricator.wikimedia.org/T244236) [09:48:22] (03PS4) 10Elukey: presto: add kerberos and tls support [puppet] - 10https://gerrit.wikimedia.org/r/570248 [09:51:21] !log install libmemcached-tools on mc-gp* servers - T240684 [09:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:23] T240684: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T240684 [09:53:23] (03PS1) 10Alexandros Kosiaris: standard: Add linux-perf to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/570254 [09:57:09] !log upload kubernetes 1.13.12 to apt.wikimedia.org stretch-wikimedia/main T244335 [09:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:12] T244335: Upgrade production kubernetes clusters to a security supported version - https://phabricator.wikimedia.org/T244335 [09:57:24] (03PS11) 10Giuseppe Lavagetto: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [09:57:26] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: raise number of workers on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/570255 [09:57:28] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: allow varying the slowlog limit [puppet] - 10https://gerrit.wikimedia.org/r/570256 [09:57:34] (03CR) 10Effie Mouzeli: [C: 03+1] standard: Add linux-perf to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/570254 (owner: 10Alexandros Kosiaris) [10:02:04] (03PS1) 10Ema: vcl: temporarily skip Host header normalization for FortiGate [puppet] - 10https://gerrit.wikimedia.org/r/570257 (https://phabricator.wikimedia.org/T243634) [10:03:41] (03CR) 10Vgutierrez: [C: 03+1] vcl: temporarily skip Host header normalization for FortiGate [puppet] - 10https://gerrit.wikimedia.org/r/570257 (https://phabricator.wikimedia.org/T243634) (owner: 10Ema) [10:03:47] (03CR) 10Ema: [C: 03+1] requests: Use POST-as-GET to fetch the issued certificate [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570252 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [10:04:43] (03CR) 10Muehlenhoff: [C: 04-1] "I like the idea, but linux-perf doesn't exist on jessie, so we need to make it conditional on stretch and later." [puppet] - 10https://gerrit.wikimedia.org/r/570254 (owner: 10Alexandros Kosiaris) [10:04:58] (03CR) 10Ema: [C: 03+2] vcl: temporarily skip Host header normalization for FortiGate [puppet] - 10https://gerrit.wikimedia.org/r/570257 (https://phabricator.wikimedia.org/T243634) (owner: 10Ema) [10:06:05] elukey: can the fake Java keystore/truststore stuff be puppet-merged? [10:06:14] (03CR) 10Vgutierrez: [C: 03+2] requests: Use POST-as-GET to fetch the issued certificate [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570252 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [10:06:21] ema: yes please :) [10:06:55] elukey: done! [10:08:15] <3 [10:10:44] !log Run mwscript deleteEqualMessages.php --delete to delete GrowthExperiments' message overrides (cswiki, viwiki, arwiki, kowiki) [10:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:24] (03PS5) 10Elukey: presto: add kerberos and tls support [puppet] - 10https://gerrit.wikimedia.org/r/570248 [10:22:58] 10Operations, 10serviceops: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10Joe) [10:24:08] !log Upload php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2 - T236800 [10:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:19] T236800: Ensure apcu incr/decr are atomic (Upgrade php-apcu) - https://phabricator.wikimedia.org/T236800 [10:24:46] !log T244335 upgrade kubernetes-master on neon.eqiad.wmnet (staging) [10:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:49] T244335: Upgrade production kubernetes clusters to a security supported version - https://phabricator.wikimedia.org/T244335 [10:25:02] (03PS1) 10Ema: Revert "vcl: temporarily skip Host header normalization for FortiGate" [puppet] - 10https://gerrit.wikimedia.org/r/570278 (https://phabricator.wikimedia.org/T243634) [10:26:11] (03CR) 10Ema: [C: 03+2] Revert "vcl: temporarily skip Host header normalization for FortiGate" [puppet] - 10https://gerrit.wikimedia.org/r/570278 (https://phabricator.wikimedia.org/T243634) (owner: 10Ema) [10:38:05] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 2.451e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [10:38:35] 10Operations, 10Traffic, 10Patch-For-Review: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10akosiaris) I blocked a number of IPs manually on cr3 and cr4 for ulsfo. Command was `set policy-options prefix-list blackhole4 ` for 5 IPs. The prefix list w... [10:43:24] !log cp4028: varnish-frontend-restart T243634 [10:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:27] T243634: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 [10:44:33] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/20616/" [puppet] - 10https://gerrit.wikimedia.org/r/570248 (owner: 10Elukey) [10:50:01] !log T244335 upgrade kubernetes-node on kubestage1002.eqiad.wmnet to 1.13.12 [10:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:05] T244335: Upgrade production kubernetes clusters to a security supported version - https://phabricator.wikimedia.org/T244335 [10:53:08] (03PS16) 10ArielGlenn: write out and reuse pagerange info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [10:53:28] (03CR) 10jerkins-bot: [V: 04-1] write out and reuse pagerange info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) (owner: 10ArielGlenn) [10:53:30] (03CR) 10Elukey: "Andrew let me know if this makes sense for you. The idea is to use self-signed certs and force HTTPS only to enable kerberos auth. It shou" [puppet] - 10https://gerrit.wikimedia.org/r/570248 (owner: 10Elukey) [10:53:37] !log rolling restart of all pods on kubernetes staging cluster to make sure everything is fine after the upgrade [10:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:37] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [11:15:17] (03PS1) 10Matthias Mullie: Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570287 (https://phabricator.wikimedia.org/T241072) [11:16:16] (03CR) 10Cparle: [C: 03+1] Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570287 (https://phabricator.wikimedia.org/T241072) (owner: 10Matthias Mullie) [11:16:18] (03PS1) 10Muehlenhoff: Explicitly add theemin to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/570288 [11:19:15] (03PS1) 10Muehlenhoff: Add lvs2009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/570289 [11:20:16] (03PS1) 10Giuseppe Lavagetto: Revert "Update scaffold template names to use chart name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/570290 [11:20:42] 10Operations, 10Traffic: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10ema) Today we've been tackling the "FortiGate" angle (correlation described in T243634#5848297). The host in trouble this morning was cp4028, with 140k FDs at 10:30. In total, 5 different "... [11:23:56] (03PS2) 10Matthias Mullie: Re-enable delayed new upload jobs for MachineVision extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570287 (https://phabricator.wikimedia.org/T241072) [11:29:45] (03CR) 10Filippo Giunchedi: [C: 03+1] Explicitly add theemin to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/570288 (owner: 10Muehlenhoff) [11:37:48] 10Operations, 10Traffic: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10akosiaris) I just reverted the cr3, cr4 uslfo change. [11:41:11] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloud: drop toollabs::external_hostname and toollabs::is_mail_relay [puppet] - 10https://gerrit.wikimedia.org/r/570294 (https://phabricator.wikimedia.org/T244222) [11:45:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "basically a NOOP https://puppet-compiler.wmflabs.org/compiler1001/20617/" [puppet] - 10https://gerrit.wikimedia.org/r/570294 (https://phabricator.wikimedia.org/T244222) (owner: 10Arturo Borrero Gonzalez) [11:45:37] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:46:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200205T1200). Please do the needful. [12:00:05] Jdlrobson and awight: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:26] I can SWAT today! [12:00:56] I'd be happy to self-deploy my change, after Jdlrobson's. [12:01:25] Jdlrobson: are you around? [12:01:52] Urbanecm: thanks for taking care of so many of these deployments! I can also do the other patches, just to simplify the window... [12:02:27] awight: hth! Your backport is abandoned in its main version, hope that's okay :) [12:02:50] ah--nvm, Jdlrobson's patches look non-trivial, so better if a developer is present. Maybe I should deploy my patch while we wait for Jon? [12:03:02] awight: go ahead [12:03:08] ty [12:04:23] awight: iim here [12:04:31] Sorry lost track of time [12:04:36] * Urbanecm waves to Jdlrobson [12:04:39] hey Urbanecm [12:04:48] so I actually only want to deploy one of my changes [12:04:55] the other I just need to sync to wikimedia debug [12:04:56] is that possible? [12:05:01] Sure [12:05:03] I'm waiting for Jenkins, so Jdlrobson please go ahead. [12:05:19] the more time I have for testing the better so that suits me :) [12:05:41] https://gerrit.wikimedia.org/r/c/570180/ is the one I just want to test on wikimedia debug but not sync [12:05:47] awight: I'll claim mwdebug1002 for Jdlrobson's debug-only change then, so we can leave it there for longer time [12:05:53] https://gerrit.wikimedia.org/r/c/570186/ should be synced [12:05:58] ack [12:06:00] *but I'll need time to test [12:06:02] thanks Urbanecm :) [12:06:15] Urbanecm: great to hear we can use both hosts again! [12:06:24] (03CR) 10Vgutierrez: [C: 03+1] Switch authdns* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566476 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [12:06:27] awight: both are broken in the same way :D [12:06:34] (03PS2) 10Muehlenhoff: Explicitly add theemin to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/570288 [12:06:35] (broken=some log noise) [12:07:09] Jdlrobson: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570180 is available at mwdebug1002 [12:07:22] hehe [12:07:42] great! beginning testing! [12:09:05] (03CR) 10Muehlenhoff: [C: 03+2] Explicitly add theemin to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/570288 (owner: 10Muehlenhoff) [12:09:32] Urbanecm: are you sure https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/570180/ is working? It doesn't seem to be kicking in [12:09:37] looking [12:09:43] (am not sure how dblists work exactly) [12:10:11] Jdlrobson: could you try now? [12:10:23] (03CR) 10Jbond: [C: 03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/568857 (owner: 10Legoktm) [12:11:22] Urbanecm: nope.. maybe i did something wrong? Looks at config.. [12:12:00] hmm... [12:12:49] it definitely looks applied [12:12:50] Urbanecm: it's possible it is working actually.. just not behaving how i expected [12:12:52] which is fine :) [12:13:10] I've now checked, the variable that is controlled by the dblist is definitely changed [12:13:20] great. let me try another test [12:13:23] sure [12:13:40] ok yep! thank you [12:13:41] all good! [12:13:44] phew! [12:13:44] Jdlrobson: ad https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570186, I can't SWAT that. Per https://wikitech.wikimedia.org/wiki/SWAT_deploys, each patch should need only one sync. You need to create several patches, each changing only IS.php, or only CS.php, not both. [12:14:12] Jdlrobson: okay. Can I revert that? [12:14:23] (03CR) 10Vgutierrez: [C: 03+1] Add lvs2009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/570289 (owner: 10Muehlenhoff) [12:15:00] yes please Urbanecm [12:15:03] reverting [12:15:07] i'll deploy that next week [12:15:23] Jdlrobson: you mean the other patch? [12:15:41] (03CR) 10Jdlrobson: [C: 03+1] "Thanks to Urbancm I was able to test this today and can confirm it works as expected. I'll deploy next week once I get the go ahead from e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570180 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [12:15:56] Urbanecm: you can revert this one ^ [12:16:10] Jdlrobson: done [12:16:22] With respect to https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/570186 what do I need to change? [12:16:29] this needs to go out before next deploy or it will cause some issues [12:16:42] so 3 patches? [12:16:44] yes [12:16:56] (03PS1) 10Vgutierrez: Release 0.23 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570303 (https://phabricator.wikimedia.org/T244236) [12:17:06] Urbanecm: once you're done, let me know, I have some backports to deploy [12:17:20] we had some issues with deployers syncing stuff in wrong order, so that's why the policy was introduced - dependency tree would make sure patches get merged in correct order. [12:17:34] Jdlrobson: let me know once the patches are done [12:17:36] RECOVERY - WDQS high update lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1056 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:17:39] tests/WgConfTestCase.php is just a docs change [12:17:46] do I still need to break it out into a separate commit? [12:17:54] Jdlrobson: no - that can be anywhere [12:17:58] (03CR) 10Jbond: [C: 04-1] "unfortunately this won't work until we enable `rich_data`" [puppet] - 10https://gerrit.wikimedia.org/r/569570 (https://phabricator.wikimedia.org/T242585) (owner: 10Filippo Giunchedi) [12:18:30] Amir1: ack [12:19:43] (03PS6) 10Jdlrobson: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) [12:19:45] (03PS1) 10Jdlrobson: Prepare tests for logo config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570304 (https://phabricator.wikimedia.org/T232140) [12:20:58] (03CR) 10Vgutierrez: [C: 03+2] Release 0.23 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570303 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [12:21:15] (03CR) 10jerkins-bot: [V: 04-1] Prepare tests for logo config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570304 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [12:21:38] (03CR) 10Muehlenhoff: "Actually -1ing, role(spare) would introduce base::firewall, while LVSes don't use it and it's non-trivial to get rid of. I'll add a role w" [puppet] - 10https://gerrit.wikimedia.org/r/570289 (owner: 10Muehlenhoff) [12:21:43] (03CR) 10Muehlenhoff: [C: 04-1] Add lvs2009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/570289 (owner: 10Muehlenhoff) [12:22:06] Jdlrobson: ping me once you're done, please [12:22:59] (03PS2) 10Jdlrobson: Prepare tests for logo config change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570304 (https://phabricator.wikimedia.org/T232140) [12:23:22] (03PS1) 10Giuseppe Lavagetto: profile::services_proxy: add temporarily entries for k8s services [puppet] - 10https://gerrit.wikimedia.org/r/570306 [12:23:45] (03PS7) 10Jdlrobson: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) [12:24:30] Is https://gerrit.wikimedia.org/r/570304 + https://gerrit.wikimedia.org/r/570186 what you meant? (former updates the tests to pass in both cases). If so I think I'm ready [12:25:03] Jdlrobson: i need one commit to change IS.php, and second one CS.php. [12:25:17] but it's just a comment? [12:25:59] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570186 seems to both change something outside a comment? [12:26:11] IS.php: line 1316, wgLogoHD => wgLogos [12:26:16] CS.php: Adds some stuff for back-compat [12:26:17] ohhhhh [12:26:19] (03PS1) 10Vgutierrez: requests: Use POST-as-GET to fetch the issued certificate [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570307 (https://phabricator.wikimedia.org/T244236) [12:26:21] (03PS1) 10Vgutierrez: Release 0.23 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570308 (https://phabricator.wikimedia.org/T244236) [12:26:22] you are talking about those files not tests [12:26:23] (03PS1) 10Vgutierrez: debian: Add release 0.23 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570309 (https://phabricator.wikimedia.org/T244236) [12:26:27] Jdlrobson: yes [12:26:34] oh my misunderstanding [12:26:34] sorry for the confusion [12:26:36] okay that makes more sense [12:26:55] awight: I guess you can go ahead as I wait for Jdlrobson [12:27:16] +1 Urbanecm thanks, my patch is merged so here goes! [12:27:24] cool [12:28:18] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10JeanFred) [12:29:15] ok Urbanecm should be 3rd time lucky [12:29:55] (03PS8) 10Jdlrobson: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) [12:29:57] (03PS3) 10Jdlrobson: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570304 (https://phabricator.wikimedia.org/T232140) [12:30:18] (03PS4) 10Jdlrobson: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570304 (https://phabricator.wikimedia.org/T232140) [12:30:24] ok Urbanecm ready when you are [12:30:55] (03PS1) 10Vgutierrez: install_server: Reimage cp5006 as buster [puppet] - 10https://gerrit.wikimedia.org/r/570310 (https://phabricator.wikimedia.org/T242093) [12:31:13] Jdlrobson: cool. Just want to make sure: I hope stuff won't break when servers won't see any wgLogoHD for some time? [12:31:23] my patch works on mwdebug1001, syncing now. [12:31:32] wgLogoHD should be optional so this shouldn't be a problem [12:31:44] Ok, makes sense [12:32:00] and temporary removal of wgLogoHD now beats removal of all logos in next weeks deploy :) [12:32:10] true [12:32:42] !log awight@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/Cite: SWAT: [[gerrit:570285|Revert follow standardization (T240858)]] (duration: 01m 13s) [12:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:46] T240858: Clean up implementation for "follow" cases - https://phabricator.wikimedia.org/T240858 [12:33:03] Urbanecm: Jdlrobson: I'm all done, thanks! [12:33:09] thanks! [12:33:16] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [12:33:22] (03PS1) 10Ema: Revert "ATS: temporarily leave AE untouched" [puppet] - 10https://gerrit.wikimedia.org/r/570311 (https://phabricator.wikimedia.org/T242478) [12:33:24] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570304 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [12:33:28] (03CR) 10Vgutierrez: [C: 03+2] requests: Use POST-as-GET to fetch the issued certificate [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570307 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [12:33:36] (03CR) 10Vgutierrez: [C: 03+2] Release 0.23 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570308 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [12:34:16] (03Merged) 10jenkins-bot: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570186 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [12:34:20] (03Merged) 10jenkins-bot: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570304 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [12:34:44] (03CR) 10Ema: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/570310 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [12:34:47] Jdlrobson: pulled both onto mwdebug1001 in case you want to test it there [12:35:03] yes please [12:35:10] lmk if it works correctly [12:35:46] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloud: tools: drop hiera keys migrated to horizon [puppet] - 10https://gerrit.wikimedia.org/r/570313 (https://phabricator.wikimedia.org/T244222) [12:36:07] (03Merged) 10jenkins-bot: requests: Use POST-as-GET to fetch the issued certificate [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570307 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [12:36:38] (03Merged) 10jenkins-bot: Release 0.23 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570308 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [12:36:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloud: tools: drop hiera keys migrated to horizon [puppet] - 10https://gerrit.wikimedia.org/r/570313 (https://phabricator.wikimedia.org/T244222) (owner: 10Arturo Borrero Gonzalez) [12:36:52] (03PS2) 10Vgutierrez: debian: Add release 0.23 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570309 (https://phabricator.wikimedia.org/T244236) [12:37:35] Jdlrobson: I see a lot of memcached errors, https://logstash.wikimedia.org/goto/dced2f987d8d2ff9a77fc4948b465114 [12:37:58] still looking [12:38:27] sure [12:39:06] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Investigate using the rich_data opsion to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) for some reason this change is not attached to the ticket https://gerrit.wikimedia.org/r/c/operations/soft... [12:39:28] Urbanecm: are you seeing memcached issues relating to logos? [12:40:36] Jdlrobson: no, but I see some related to resourceloader, which is mentioned in your commits [12:41:10] (03PS1) 10Arturo Borrero Gonzalez: hiera: cloud: tools: drop data for tools-bastion-03 [puppet] - 10https://gerrit.wikimedia.org/r/570314 (https://phabricator.wikimedia.org/T244222) [12:41:35] both changes are on there or just one? [12:41:39] both [12:41:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: cloud: tools: drop data for tools-bastion-03 [puppet] - 10https://gerrit.wikimedia.org/r/570314 (https://phabricator.wikimedia.org/T244222) (owner: 10Arturo Borrero Gonzalez) [12:44:44] change LGTM [12:47:04] ok [12:48:54] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 5cc2b70: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos (T232140) (duration: 01m 07s) [12:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:58] T232140: Separate out logo handling into square image logos and long text/wordmark banner logos - https://phabricator.wikimedia.org/T232140 [12:48:59] Jdlrobson: first one synced [12:49:29] syncing the second one [12:50:05] Urbanecm: all good so far.. [12:50:32] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: d450288: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos (T232140) (duration: 01m 07s) [12:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:43] Jdlrobson: and second one is out too [12:50:46] thanks for your patience [12:51:20] Amir1: air is clear [12:51:35] cool, I'm already testing it in mwdebug1001 [12:51:54] Urbanecm: thank you ! [12:51:58] Happy to help! [12:55:36] !log disable transit/peering BGP sessions on cr2-eqdfw [12:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:51] (03CR) 10Vgutierrez: "recheck" [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570309 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [12:58:40] (03PS1) 10Muehlenhoff: Switch cescout* to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/570316 (https://phabricator.wikimedia.org/T156955) [13:00:45] Urbanecm: looks like there may be a problem? https://logstash.wikimedia.org/app/kibana#/discover?_g=h@c8f79bd&_a=h@3dd85cf [13:00:57] !log SWAT needs more time [13:00:59] PHP Notice: Undefined variable: wgLogos in /srv/mediawiki/wmf-config/CommonSettings.php on line 857 [13:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:17] !log reboot cr2-eqdfw for software upgrade [13:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:19] UhOh, the cache issue... [13:01:30] Amir1: you fine with me resyncing that? [13:01:35] IS.php? [13:01:37] yup [13:01:51] okay [13:02:02] submitted [13:03:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 5cc2b70: wgLogoHD and $wgVectorPrintLogo is replaced with wgLogos (T232140) (duration: 01m 06s) [13:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:10] T232140: Separate out logo handling into square image logos and long text/wordmark banner logos - https://phabricator.wikimedia.org/T232140 [13:03:21] Jdlrobson: should be fine now [13:03:39] (03PS3) 10Vgutierrez: debian: Add release 0.23 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570309 (https://phabricator.wikimedia.org/T244236) [13:04:17] (03Abandoned) 10Filippo Giunchedi: cassandra: use wmflib::secret for binary files [puppet] - 10https://gerrit.wikimedia.org/r/569570 (https://phabricator.wikimedia.org/T242585) (owner: 10Filippo Giunchedi) [13:04:58] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:05:32] (03CR) 10Filippo Giunchedi: [C: 03+1] Switch cescout* to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/570316 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:05:48] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:06:03] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage cp5006 as buster [puppet] - 10https://gerrit.wikimedia.org/r/570310 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [13:06:28] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:06:59] all of those are expected ^ [13:08:00] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.23 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570309 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [13:08:48] !log depooling & reimaging cp5006 as buster - T242093 [13:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:52] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [13:08:56] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:09:25] (03CR) 10Gilles: "I've created a docker image available on docker hub with all you need, that should be helpful:" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/568646 (https://phabricator.wikimedia.org/T228467) (owner: 10Brion VIBBER) [13:09:38] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:10:37] !log rollback: disable transit/peering BGP sessions on cr2-eqdfw [13:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:06] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp5006.eqsin.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [13:14:00] (03CR) 10Gilles: [C: 04-1] "3 tests are failing with this patch applied, due to visual dissimilarity. For some tests the difference is huge, which might suggest that " [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/568646 (https://phabricator.wikimedia.org/T228467) (owner: 10Brion VIBBER) [13:15:03] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.16/extensions/Wikibase/lib/includes/Store/CachingPropertyInfoLookup.php: SWAT: [[gerrit:570301|Cache PropertyInfoLookup internally]] (T243955) (duration: 01m 07s) [13:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:06] T243955: CachingPropertyInfoLookup doesn't cache lookups internally - https://phabricator.wikimedia.org/T243955 [13:15:37] !log disable transit/peering BGP sessions on cr2-eqord [13:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:57] (03CR) 10Gilles: "As pointed out in the other patch, the docker image I've just made should help you write tests easily: https://wikitech.wikimedia.org/wiki" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) (owner: 10Brion VIBBER) [13:16:12] !log upload acme-chief 0.23 to apt.wm.o (buster) - T244236 [13:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:17] T244236: acme-chief is unable to renew certificates against LE staging environment - https://phabricator.wikimedia.org/T244236 [13:17:46] !log increase ospf cost for cr2-eqord links [13:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:43] (03Restored) 10Filippo Giunchedi: cassandra: use wmflib::secret for binary files [puppet] - 10https://gerrit.wikimedia.org/r/569570 (https://phabricator.wikimedia.org/T242585) (owner: 10Filippo Giunchedi) [13:22:20] (03PS3) 10Filippo Giunchedi: cassandra: use wmflib::secret for binary files [puppet] - 10https://gerrit.wikimedia.org/r/569570 (https://phabricator.wikimedia.org/T242585) [13:22:22] (03PS4) 10Filippo Giunchedi: wip: cassandra logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/569564 (https://phabricator.wikimedia.org/T242585) [13:24:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "Added a disclaimer pointing to the ticket after chatting with John, should be good enough to make PCC available again and DTRT in producti" [puppet] - 10https://gerrit.wikimedia.org/r/569570 (https://phabricator.wikimedia.org/T242585) (owner: 10Filippo Giunchedi) [13:24:46] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.18/extensions/Wikibase/lib/includes/Store/CachingPropertyInfoLookup.php: SWAT: [[gerrit:570301|Cache PropertyInfoLookup internally]] (T243955) (duration: 01m 07s) [13:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:50] T243955: CachingPropertyInfoLookup doesn't cache lookups internally - https://phabricator.wikimedia.org/T243955 [13:25:16] !log reboot cr2-eqord for software upgrade - yaaaaa [13:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor nit, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/563374 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:28:57] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:29:54] !log manually set 10.2.1.42 eventgate-analytics.discovery.wmnet in /etc/hosts for mw1331, mw1348. Verify hypothesis that this should cause increased latency [13:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:01] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:31:53] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/6 UP : OSPFv3: 4/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:32:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:32:38] (03PS1) 10Jdlrobson: Restore wgLogoHD to wikis without a MinervaCustomLogos defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570326 (https://phabricator.wikimedia.org/T232140) [13:33:01] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:33:13] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:33:39] annnnndd it's back! [13:34:25] (03PS2) 10Ema: Revert "ATS: temporarily leave AE untouched" [puppet] - 10https://gerrit.wikimedia.org/r/570311 (https://phabricator.wikimedia.org/T242478) [13:35:06] !log rollback traffic steering off cr2-eqord [13:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:20] (03PS17) 10ArielGlenn: write out and reuse pagerange info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [13:37:27] (03CR) 10Vgutierrez: [C: 03+1] Revert "ATS: temporarily leave AE untouched" [puppet] - 10https://gerrit.wikimedia.org/r/570311 (https://phabricator.wikimedia.org/T242478) (owner: 10Ema) [13:37:44] (03CR) 10jerkins-bot: [V: 04-1] write out and reuse pagerange info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) (owner: 10ArielGlenn) [13:38:25] !log cp: disable puppet and merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/570311/ T242478 [13:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:28] T242478: Production load.php spends ~ 10% time doing output compression within PHP - https://phabricator.wikimedia.org/T242478 [13:39:17] !log EU SWAT is done [13:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:38] (03CR) 10Ema: [C: 03+2] Revert "ATS: temporarily leave AE untouched" [puppet] - 10https://gerrit.wikimedia.org/r/570311 (https://phabricator.wikimedia.org/T242478) (owner: 10Ema) [13:40:27] Amir1: looks like I broke higher dpi logos on quite a few projects with my change. It's late where I am (am on UTC+8) but I've asked someone in my team to deploy https://gerrit.wikimedia.org/r/570326 in a later swat window. Letting you know in case anyone raises the issue here. [13:41:28] !log cp1075: unset Accept-Encoding on origin server requests T242478 [13:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:00] Jdlrobson: sure, right now we have another problem though (increased latency) [13:42:03] !log undo the manually set 10.2.1.42 eventgate-analytics.discovery.wmnet in /etc/hosts for mw1331, mw1348. Verify hypothesis that this should cause increased latency. Restart php-fpm [13:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:10] Amir1: wait out a bit [13:42:13] I may be to blame [13:42:36] * Amir1 loves blaming [13:42:48] (lol) [13:43:07] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:10] I may also not be to blame ofc. The jump at the memcached gets probably exhonerates me however [13:43:31] yeah, doesn't look like it's my change [13:45:29] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:51] !log Decrease buffer pool size on db1107 for testing - T242702 [13:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:54] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [13:48:27] (03PS18) 10ArielGlenn: write out and reuse pagerange info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [13:50:58] (03PS2) 10Muehlenhoff: Switch ORES to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/563374 (https://phabricator.wikimedia.org/T156955) [13:51:23] (03CR) 10jerkins-bot: [V: 04-1] Switch ORES to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/563374 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:51:30] (03CR) 10Muehlenhoff: Switch ORES to standard partman recipes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/563374 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:54:47] (03PS1) 10Jbond: wmflib::end_with: create String.end_with function [puppet] - 10https://gerrit.wikimedia.org/r/570330 (https://phabricator.wikimedia.org/T244222) [13:54:49] (03PS1) 10Jbond: realm global: make the realm variable a global in labs [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222) [13:55:02] (03PS3) 10Muehlenhoff: Switch ORES to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/563374 (https://phabricator.wikimedia.org/T156955) [13:55:33] (03PS1) 10Vgutierrez: requests: Fix content-type on fetch_certificate [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570332 (https://phabricator.wikimedia.org/T244236) [13:56:39] (03PS4) 10Muehlenhoff: Switch ORES to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/563374 (https://phabricator.wikimedia.org/T156955) [13:58:15] (03CR) 10Muehlenhoff: [C: 03+2] Switch ORES to standard partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/563374 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:58:43] (03CR) 10Ottomata: [C: 03+1] "Sounds good to me! Is there a reason not to use the PuppetCA? Totally fine with self signed, just curious." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570248 (owner: 10Elukey) [13:59:22] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 3/5 UP : OSPFv3: 3/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:00:20] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:00:57] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5006.eqsin.wmnet'] ` and were **ALL** successful. [14:01:31] (03PS2) 10Muehlenhoff: Switch authdns* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566476 (https://phabricator.wikimedia.org/T156955) [14:01:59] (03CR) 10Ema: [C: 03+1] requests: Fix content-type on fetch_certificate [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570332 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [14:02:43] 10Operations, 10observability: Provision grafana VM in codfw - https://phabricator.wikimedia.org/T244357 (10fgiunchedi) [14:02:51] i leave the wikis in your more than capable hands awight Amir1 :) [14:02:51] (03CR) 10Vgutierrez: [C: 03+2] requests: Fix content-type on fetch_certificate [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570332 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [14:03:41] (03CR) 10Jdlrobson: [C: 03+1] "please swat asap to restore HD logos to a bunch of wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570326 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [14:03:44] Amir1: I'm also finished, so it's all yours! [14:04:23] I have nothing to do right now, do you want me to deploy things? [14:04:27] I'm slightly confused now [14:05:47] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570248 (owner: 10Elukey) [14:06:14] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:06:46] Amir1: No, but I heard a rumor earlier that you had your own backports :-) [14:07:21] awight: I backported them one hour and five minutes ago :P [14:08:12] bahaha. Until next time, then! [14:09:40] (03PS1) 10Ema: ATS: temporarily leave AE untouched [puppet] - 10https://gerrit.wikimedia.org/r/570336 (https://phabricator.wikimedia.org/T242478) [14:12:36] (03CR) 10Ema: [C: 03+2] ATS: temporarily leave AE untouched [puppet] - 10https://gerrit.wikimedia.org/r/570336 (https://phabricator.wikimedia.org/T242478) (owner: 10Ema) [14:13:51] !log cp1075: back to leaving Accept-Encoding as it is due to unrelated applayer issues T242478 [14:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:54] T242478: Production load.php spends ~ 10% time doing output compression within PHP - https://phabricator.wikimedia.org/T242478 [14:14:11] (03CR) 10Vgutierrez: [C: 03+2] Release 0.24 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570338 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [14:14:38] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:16:42] (03CR) 10Ottomata: [C: 03+1] "Makes sense proceed! TY!" [puppet] - 10https://gerrit.wikimedia.org/r/570248 (owner: 10Elukey) [14:17:11] (03Merged) 10jenkins-bot: Release 0.24 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/570338 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [14:17:14] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:17:14] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:18:18] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:19:33] ACKNOWLEDGEMENT - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 Ayounsi T/S with Juniper https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:21:17] (03PS1) 10Vgutierrez: requests: Fix content-type on fetch_certificate [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570340 (https://phabricator.wikimedia.org/T244236) [14:21:19] (03PS1) 10Vgutierrez: Release 0.24 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570341 (https://phabricator.wikimedia.org/T244236) [14:21:21] (03PS1) 10Vgutierrez: debian: Add release 0.24 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570342 (https://phabricator.wikimedia.org/T244236) [14:23:02] !log pooling cp5006 - T242093 [14:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:05] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [14:24:16] (03CR) 10Vgutierrez: [C: 03+2] requests: Fix content-type on fetch_certificate [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570340 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [14:24:28] (03CR) 10Vgutierrez: [C: 03+2] Release 0.24 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570341 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [14:26:25] (03PS1) 10Jbond: wmflib::require_domains: add new domain to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 [14:26:37] !log push inital flowspec config to all routers [14:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:46] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.24 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/570342 (https://phabricator.wikimedia.org/T244236) (owner: 10Vgutierrez) [14:29:44] (03PS2) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 [14:29:52] (03PS2) 10Jbond: realm global: make the realm variable a global in labs [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222) [14:30:19] (03PS3) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 [14:30:30] !log upload acme-chief 0.24 to apt.wm.o (buster) - T244236 [14:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:33] T244236: acme-chief is unable to renew certificates against LE staging environment - https://phabricator.wikimedia.org/T244236 [14:30:37] (03PS1) 10Giuseppe Lavagetto: mcrouter: run at nice -19 as php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/570346 [14:32:14] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:32:32] <_joe_> !log restarting mcrouter at nice -19 on mw1331 for testing effects of that change [14:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:05] (03CR) 10Elukey: [C: 03+1] mcrouter: run at nice -19 as php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/570346 (owner: 10Giuseppe Lavagetto) [14:34:06] (03CR) 10jerkins-bot: [V: 04-1] wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 (owner: 10Jbond) [14:34:59] !log updating acme-chief to version 0.24 - T244236 [14:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:24] 10Operations, 10SRE-tools: Homer: commit> no causes stacktrace - https://phabricator.wikimedia.org/T244362 (10ayounsi) p:05Triage→03Low [14:35:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mcrouter: run at nice -19 as php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/570346 (owner: 10Giuseppe Lavagetto) [14:36:13] <_joe_> sigh there is an error though [14:36:35] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:36:58] (03PS2) 10Giuseppe Lavagetto: mcrouter: run at nice -19 as php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/570346 [14:37:29] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: acme-chief is unable to renew certificates against LE staging environment - https://phabricator.wikimedia.org/T244236 (10Vgutierrez) 05Open→03Resolved Fixed by backporting https://github.com/certbot/certbot/commit/0b5468e992ab57fa028ddf33ca2351... [14:37:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mcrouter: run at nice -19 as php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/570346 (owner: 10Giuseppe Lavagetto) [14:38:44] (03PS1) 10Vgutierrez: install_server: Reimage cp5012 as buster [puppet] - 10https://gerrit.wikimedia.org/r/570347 (https://phabricator.wikimedia.org/T242093) [14:39:19] (03PS1) 10Jbond: wmflib::require_domains: use require_domains instead of require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570348 (https://phabricator.wikimedia.org/T244222) [14:39:37] (03CR) 10Jbond: "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/569570 (https://phabricator.wikimedia.org/T242585) (owner: 10Filippo Giunchedi) [14:39:43] (03CR) 10Jbond: [C: 03+1] cassandra: use wmflib::secret for binary files [puppet] - 10https://gerrit.wikimedia.org/r/569570 (https://phabricator.wikimedia.org/T242585) (owner: 10Filippo Giunchedi) [14:39:56] 10Operations, 10SRE-tools: Homer: commit timeout on MX104 and SRXs - https://phabricator.wikimedia.org/T244363 (10ayounsi) p:05Triage→03Normal [14:40:33] (03PS1) 10Elukey: profile::memcached::instance: add the theads parameter [puppet] - 10https://gerrit.wikimedia.org/r/570349 [14:42:36] (03PS1) 10Jbond: realm: remove realm global variable [puppet] - 10https://gerrit.wikimedia.org/r/570350 (https://phabricator.wikimedia.org/T244222) [14:43:32] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/20619/ - noop as expected" [puppet] - 10https://gerrit.wikimedia.org/r/570349 (owner: 10Elukey) [14:46:14] (03CR) 10jerkins-bot: [V: 04-1] realm: remove realm global variable [puppet] - 10https://gerrit.wikimedia.org/r/570350 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [14:48:20] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [14:48:34] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [14:49:07] <_joe_> jbond42: removing ::realm? [14:49:18] <_joe_> that's... quite complex, why would you want to? [14:50:58] (03CR) 10Filippo Giunchedi: [C: 03+2] cassandra: use wmflib::secret for binary files [puppet] - 10https://gerrit.wikimedia.org/r/569570 (https://phabricator.wikimedia.org/T242585) (owner: 10Filippo Giunchedi) [14:52:24] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Production load.php spends ~ 10% time doing output compression within PHP - https://phabricator.wikimedia.org/T242478 (10ema) I have applied the change to cp1075 for some minutes, and the effect on network transfer is [[https://grafana.wikime... [14:53:34] (03PS6) 10Elukey: presto: add kerberos and tls support [puppet] - 10https://gerrit.wikimedia.org/r/570248 [14:53:59] _joe_: i fell down a rabbit whole, i started moving it from hiera to global variable so that wmcs could have a more complex hiera hierarcy and then thought that perhaps this may be a simpler alternative [14:54:06] (03PS4) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 [14:54:40] _joe_: also it looks like the actual realm variabls is not really used any more appart from in the require_realm function [14:55:12] <_joe_> jbond42: uh? [14:55:31] <_joe_> ~/Code/WMF/operations/puppet (production=)$ git grep ::realm | wc -l [14:55:32] <_joe_> 75 [14:57:01] (03CR) 10Effie Mouzeli: [C: 03+1] profile::memcached::instance: add the theads parameter [puppet] - 10https://gerrit.wikimedia.org/r/570349 (owner: 10Elukey) [14:57:26] (03CR) 10jerkins-bot: [V: 04-1] wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 (owner: 10Jbond) [14:57:36] _joe_: thanks i forgoto to quialifyt it while checking ill drop the last two cr's in that set [14:58:57] (03PS1) 10Ema: Revert "ATS: temporarily leave AE untouched" [puppet] - 10https://gerrit.wikimedia.org/r/570352 (https://phabricator.wikimedia.org/T242478) [14:59:06] (03Abandoned) 10Jbond: realm: remove realm global variable [puppet] - 10https://gerrit.wikimedia.org/r/570350 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [14:59:14] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 8429 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:59:30] <_joe_> this is expected more or less ^^ [14:59:44] <_joe_> puppet is running and restarting mcrouter across the fleet [15:01:03] (03PS1) 10Filippo Giunchedi: Revert "cassandra: use wmflib::secret for binary files" [puppet] - 10https://gerrit.wikimedia.org/r/570353 [15:01:06] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 511 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:01:27] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "cassandra: use wmflib::secret for binary files" [puppet] - 10https://gerrit.wikimedia.org/r/570353 (owner: 10Filippo Giunchedi) [15:04:49] (03CR) 10Ema: [C: 03+1] install_server: Reimage cp5012 as buster [puppet] - 10https://gerrit.wikimedia.org/r/570347 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [15:05:50] (03PS5) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 [15:05:52] (03CR) 10Vgutierrez: [C: 03+1] Revert "ATS: temporarily leave AE untouched" [puppet] - 10https://gerrit.wikimedia.org/r/570352 (https://phabricator.wikimedia.org/T242478) (owner: 10Ema) [15:06:27] (03PS2) 10Jbond: wmflib::require_domains: use require_domains instead of require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570348 (https://phabricator.wikimedia.org/T244222) [15:06:29] (03CR) 10Elukey: [C: 03+2] presto: add kerberos and tls support [puppet] - 10https://gerrit.wikimedia.org/r/570248 (owner: 10Elukey) [15:08:33] (03PS5) 10Filippo Giunchedi: cassandra: restbase-dev logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/569564 (https://phabricator.wikimedia.org/T242585) [15:08:40] (03CR) 10Ema: [C: 03+2] Revert "ATS: temporarily leave AE untouched" [puppet] - 10https://gerrit.wikimedia.org/r/570352 (https://phabricator.wikimedia.org/T242478) (owner: 10Ema) [15:09:02] (03CR) 10jerkins-bot: [V: 04-1] wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 (owner: 10Jbond) [15:12:00] !log cp: unset Accept-Encoding from ats-be requests to applayer T242478 [15:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:04] T242478: Production load.php spends ~ 10% time doing output compression within PHP - https://phabricator.wikimedia.org/T242478 [15:12:59] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage cp5012 as buster [puppet] - 10https://gerrit.wikimedia.org/r/570347 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [15:15:19] !log depooling & reimaging cp5012 as buster - T242093 [15:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:21] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [15:17:12] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:17:46] (03PS6) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 [15:17:57] (03PS3) 10Jbond: wmflib::require_domains: use require_domains instead of require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570348 (https://phabricator.wikimedia.org/T244222) [15:18:28] 10Operations, 10Traffic: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10ema) This just happened on cp1087: ` Feb 05 15:14:05 cp1087 systemd[1]: Reloaded Apache Traffic Server is a fast, scalable and extensible caching proxy server.. Fe... [15:19:23] ACKNOWLEDGEMENT - traffic_server backend process restarted on cp1087 is CRITICAL: 2 ge 2 Ema Known issue: https://phabricator.wikimedia.org/T242952 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqiad+prometheus/ops&var-instance=cp1087&var-layer=backend [15:19:23] ACKNOWLEDGEMENT - traffic_server backend process restarted on cp5010 is CRITICAL: 2 ge 2 Ema Known issue: https://phabricator.wikimedia.org/T242952 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5010&var-layer=backend [15:21:35] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp5012.eqsin.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [15:24:25] !log Rollout php-apcu_5.1.17+4.0.11-1+0~20190217111312.9+stretch~1.gbp192528+wmf2 to api, app and jobrunner canaries - T236800 [15:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:33] T236800: Ensure apcu incr/decr are atomic (Upgrade php-apcu) - https://phabricator.wikimedia.org/T236800 [15:25:20] (03PS1) 10Jhedden: openstack: update cloudvirt101[56] pool status [puppet] - 10https://gerrit.wikimedia.org/r/570358 (https://phabricator.wikimedia.org/T243327) [15:26:27] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.wikimedia.org - https://phabricator.wikimedia.org/T241337 (10Gehel) [15:27:29] (03CR) 10Jhedden: [C: 03+2] openstack: update cloudvirt101[56] pool status [puppet] - 10https://gerrit.wikimedia.org/r/570358 (https://phabricator.wikimedia.org/T243327) (owner: 10Jhedden) [15:27:44] (03PS7) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 [15:28:03] (03PS4) 10Jbond: wmflib::require_domains: use require_domains instead of require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570348 (https://phabricator.wikimedia.org/T244222) [15:29:42] !log restart php-fpm on canaries - T236800 [15:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:45] T236800: Ensure apcu incr/decr are atomic (Upgrade php-apcu) - https://phabricator.wikimedia.org/T236800 [15:31:51] (03PS1) 10Ottomata: Temporarily allow hadoop test cluster workers to talk to JupyterHub on analytics1030 [puppet] - 10https://gerrit.wikimedia.org/r/570362 (https://phabricator.wikimedia.org/T224658) [15:35:18] (03CR) 10Elukey: [C: 03+2] profile::memcached::instance: add the theads parameter [puppet] - 10https://gerrit.wikimedia.org/r/570349 (owner: 10Elukey) [15:36:14] (03CR) 10Ottomata: [V: 03+2 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/20620/analytics1030.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/570362 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [15:37:12] <_joe_> we're having anotrher problem [15:38:12] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:39:01] (03PS1) 10Jhedden: openstack: switch cloudvirt101[56] to ceph storage [puppet] - 10https://gerrit.wikimedia.org/r/570363 (https://phabricator.wikimedia.org/T243327) [15:39:52] (03PS2) 10Jhedden: openstack: switch cloudvirt101[56] to ceph storage [puppet] - 10https://gerrit.wikimedia.org/r/570363 (https://phabricator.wikimedia.org/T243327) [15:41:26] (03PS1) 10Elukey: role::mediawiki::memcached:gutter: set threads to 16 [puppet] - 10https://gerrit.wikimedia.org/r/570364 (https://phabricator.wikimedia.org/T240684) [15:41:35] (03CR) 10Andrew Bogott: [C: 03+1] "Assuming this works and gets set in the right order for VMs, looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [15:42:46] (03CR) 10Jforrester: "This isn't deploy-safe. You've just made $wgLogos non-false, but not set the ['1x'] value yet. This blew up Beta Cluster (T244370) and wil" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570304 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [15:43:44] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:43:47] (03CR) 10Arturo Borrero Gonzalez: "LGTM." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570343 (owner: 10Jbond) [15:43:49] (03CR) 10CDanis: [C: 03+1] Explicitly add theemin to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/570288 (owner: 10Muehlenhoff) [15:46:14] (03CR) 10Jforrester: "> Patch Set 4:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570304 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [15:47:48] (03CR) 10Elukey: [C: 03+2] role::mediawiki::memcached:gutter: set threads to 16 [puppet] - 10https://gerrit.wikimedia.org/r/570364 (https://phabricator.wikimedia.org/T240684) (owner: 10Elukey) [15:50:46] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10jijiki) @wiki_willy @Jclark-ctr I understand that eqiad is overloaded, but is there a chance we can raise the priority of this? We have been sufferin... [15:51:02] (03PS8) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 [15:51:38] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for niedzielski - https://phabricator.wikimedia.org/T243924 (10MarkTraceur) Approved as manager! [15:51:52] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for niedzielski - https://phabricator.wikimedia.org/T243924 (10MarkTraceur) [15:52:31] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:49] I'm going to do a quick deploy or two. [15:52:54] (03CR) 10Jforrester: [C: 03+2] Restore wgLogoHD to wikis without a MinervaCustomLogos defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570326 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [15:52:56] (03PS3) 10Jbond: realm global: make the realm variable a global in labs [puppet] - 10https://gerrit.wikimedia.org/r/570331 (https://phabricator.wikimedia.org/T244222) [15:54:14] (03PS9) 10Jbond: wmflib::require_domains: add new function to to replace require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570343 [15:54:24] (03PS5) 10Jbond: wmflib::require_domains: use require_domains instead of require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570348 (https://phabricator.wikimedia.org/T244222) [15:54:50] (03CR) 10Jbond: "Thanks updated" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570343 (owner: 10Jbond) [15:54:50] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:26] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.wikimedia.org - https://phabricator.wikimedia.org/T241337 (10Papaul) Talked to @Gehel on IRC those servers will be in the private VLAN and not in the public VLAN with Stretch as OS. [15:59:23] (03PS1) 10Arturo Borrero Gonzalez: realm: make the realm variable a global in labs [puppet] - 10https://gerrit.wikimedia.org/r/570369 (https://phabricator.wikimedia.org/T244222) [16:00:04] anomie and urandom: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Sessionstore deployment (mediawiki-config) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200205T1600). [16:00:04] urandom: A patch you scheduled for Sessionstore deployment (mediawiki-config) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:02:05] (03PS1) 10Elukey: Raise memcached threads to 8 (was: 4) on mc1025 [puppet] - 10https://gerrit.wikimedia.org/r/570370 [16:05:42] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5012.eqsin.wmnet'] ` Of which those **FAILED**: ` ['cp5012.eqsin.wmnet'] ` [16:07:04] (03PS5) 10Eevans: Configure group0 & group1 for kask-transition (multi-write kask/redis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569678 (https://phabricator.wikimedia.org/T243106) [16:07:33] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) All the servers on the table above are running Buster I had a chat with @MoritzMuehlenhoff and he mentioned that we need to install Stretch on those servers so I have to upd... [16:07:57] !log update puppet compiler's facts [16:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:16] (03CR) 10Anomie: [C: 03+2] Configure group0 & group1 for kask-transition (multi-write kask/redis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569678 (https://phabricator.wikimedia.org/T243106) (owner: 10Eevans) [16:09:36] (03PS1) 10CDanis: librenms API scrape alert: make critical [puppet] - 10https://gerrit.wikimedia.org/r/570371 (https://phabricator.wikimedia.org/T224888) [16:12:38] (03PS2) 10CDanis: librenms API scrape alert: make critical & change name [puppet] - 10https://gerrit.wikimedia.org/r/570371 (https://phabricator.wikimedia.org/T224888) [16:14:05] (03PS19) 10ArielGlenn: write out and reuse pagerange info for big page content jobs [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) [16:15:32] (03CR) 10Jbond: "In comparing this with https://gerrit.wikimedia.org/r/c/operations/puppet/+/570331 im not sure if they really differ on a functional level" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570369 (https://phabricator.wikimedia.org/T244222) (owner: 10Arturo Borrero Gonzalez) [16:16:16] (03CR) 10CDanis: "PCC looks correct https://puppet-compiler.wmflabs.org/compiler1001/20630/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/570371 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [16:18:25] (03CR) 10Alexandros Kosiaris: [C: 03+1] Raise memcached threads to 8 (was: 4) on mc1025 [puppet] - 10https://gerrit.wikimedia.org/r/570370 (owner: 10Elukey) [16:18:27] (03PS1) 10Elukey: presto: set server parameter in local presto exec script [puppet] - 10https://gerrit.wikimedia.org/r/570372 [16:18:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/570370 (owner: 10Elukey) [16:20:16] (03PS3) 10Muehlenhoff: Switch authdns* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566476 (https://phabricator.wikimedia.org/T156955) [16:20:48] (03CR) 10Ayounsi: librenms API scrape alert: make critical & change name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570371 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [16:21:20] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [16:21:56] (03CR) 10CDanis: librenms API scrape alert: make critical & change name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570371 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [16:22:24] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [16:22:53] urandom: BTW, I still have the conch. As soon as I can finish, it's yours. [16:23:11] (03CR) 10Ayounsi: [C: 03+1] librenms API scrape alert: make critical & change name [puppet] - 10https://gerrit.wikimedia.org/r/570371 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [16:23:12] conch? [16:23:22] I'm still mid-deploy. [16:23:31] Or, rather, I would be, if CI was working. [16:23:34] (03Merged) 10jenkins-bot: Restore wgLogoHD to wikis without a MinervaCustomLogos defined [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570326 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [16:23:40] Finally. [16:23:57] urandom: https://en.wikipedia.org/wiki/Conch#Literature_and_the_oral_tradition [16:24:12] (the second bullet point about Lord of the Flies) [16:24:52] (03PS2) 10Elukey: presto: set server parameter in local presto exec script [puppet] - 10https://gerrit.wikimedia.org/r/570372 [16:25:22] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T232140 Restore wgLogoHD to wikis without a MinervaCustomLogos defined (duration: 01m 09s) [16:25:23] it's like a virtual human-level mutex lock. whomever holds the conch is who is acting/speaking at the time. [16:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:26] T232140: Separate out logo handling into square image logos and long text/wordmark banner logos - https://phabricator.wikimedia.org/T232140 [16:25:37] urandom: OK, over to you. [16:25:38] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Investigate using the rich_data opsion to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) [16:25:45] I actually think I knew this, but I don't know what it means in the context here [16:26:03] 10Operations, 10puppet-compiler: puppet-compiler fails to compile production catalog for restbase2014 - https://phabricator.wikimedia.org/T238053 (10jbond) [16:26:07] (03Merged) 10jenkins-bot: Configure group0 & group1 for kask-transition (multi-write kask/redis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569678 (https://phabricator.wikimedia.org/T243106) (owner: 10Eevans) [16:26:11] in this context, it's the human currently deploying holds the conch, so we don't have two deployers stepping on each other. [16:26:17] 10Operations, 10puppet-compiler: puppet-compiler fails to compile production catalog for restbase2014 - https://phabricator.wikimedia.org/T238053 (10jbond) [16:26:43] urandom: As in I said in here half an hour ago that I was deploying [16:26:57] (03CR) 10Cwhite: "Nonblocking reply." [puppet] - 10https://gerrit.wikimedia.org/r/570330 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [16:27:03] bblack: gotcha, I didn't realize the previous window was waiting [16:27:12] 10Operations, 10puppet-compiler: puppet-compiler fails to compile production catalog for restbase2014 - https://phabricator.wikimedia.org/T238053 (10jbond) [16:27:22] James_F: yeah, I didn't go back that far in the backscroll...my bad [16:27:26] there seems to be a big queue for jenkins https://integration.wikimedia.org/zuul/ [16:27:29] sigh [16:29:59] 10Operations, 10puppet-compiler: puppet-compiler fails to compile production catalog for restbase2014 - https://phabricator.wikimedia.org/T238053 (10jbond) [16:30:13] 10Operations, 10puppet-compiler: puppet-compiler fails to compile production catalog for restbase2014 - https://phabricator.wikimedia.org/T238053 (10jbond) [16:30:46] elukey: We're on it. [16:31:04] elukey: known, waiting for g&s to clear and then Taking Action™ [16:32:01] (03CR) 10Muehlenhoff: [C: 03+2] Switch authdns* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/566476 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:33:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Raise memcached threads to 8 (was: 4) on mc1025 [puppet] - 10https://gerrit.wikimedia.org/r/570370 (owner: 10Elukey) [16:35:32] (03PS2) 10Muehlenhoff: Switch cescout* to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/570316 (https://phabricator.wikimedia.org/T156955) [16:37:03] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:569678]] Config: Enable sessionstore on group0 and 1 T243106 (duration: 01m 08s) [16:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:07] T243106: Phased rollout of sessionstore to production fleet - https://phabricator.wikimedia.org/T243106 [16:38:09] (03PS1) 10Filippo Giunchedi: WIP: elasticsearch cirrus logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/570374 [16:40:46] jouncebot: now [16:40:46] For the next 0 hour(s) and 19 minute(s): Sessionstore deployment (mediawiki-config) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200205T1600) [16:40:52] jouncebot: next [16:40:52] In 2 hour(s) and 19 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200205T1900) [16:41:49] (03CR) 10Thcipriani: "recheck" [dumps] - 10https://gerrit.wikimedia.org/r/566580 (https://phabricator.wikimedia.org/T243434) (owner: 10ArielGlenn) [16:42:04] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:43:09] addshore: Prod is not clear. [16:43:25] James_F: indeed :) I was just looking around! [16:44:57] messaging jouncebot -- the universal "everyone stand back" signal [16:51:24] thcipriani: hold my beer [16:51:57] hahaha [16:52:41] (03PS2) 10Filippo Giunchedi: WIP: elasticsearch cirrus logs to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/570374 [16:53:36] !log Sessionstore deployment (mediawiki-config) is done [16:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:02] (03PS2) 10Alexandros Kosiaris: standard: Add linux-perf to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/570254 [16:59:12] (03CR) 10Paladox: "recheck" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/566890 (owner: 10Paladox) [17:00:22] (03CR) 10jerkins-bot: [V: 04-1] standard: Add linux-perf to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/570254 (owner: 10Alexandros Kosiaris) [17:00:25] (03CR) 10Muehlenhoff: [C: 03+1] standard: Add linux-perf to standard packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570254 (owner: 10Alexandros Kosiaris) [17:00:49] (03CR) 10Muehlenhoff: [C: 03+1] standard: Add linux-perf to standard packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/570254 (owner: 10Alexandros Kosiaris) [17:02:10] (03PS4) 10Dzahn: add IP addresses for new install servers on buster [dns] - 10https://gerrit.wikimedia.org/r/569679 (https://phabricator.wikimedia.org/T224576) [17:02:32] 10Operations, 10observability: Upgrade Grafana to 6.6 - https://phabricator.wikimedia.org/T244208 (10fgiunchedi) [17:05:31] (03PS3) 10Paladox: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/562363 [17:05:50] (03PS1) 10Jforrester: Set $wgLogos['1x'] (new style access) to $wgLogo (old style access) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570378 (https://phabricator.wikimedia.org/T232140) [17:05:53] (03PS1) 10Jforrester: Merge $wgLogo into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) [17:05:57] (03CR) 10Elukey: [C: 03+2] presto: set server parameter in local presto exec script [puppet] - 10https://gerrit.wikimedia.org/r/570372 (owner: 10Elukey) [17:06:34] (03CR) 10Jforrester: [C: 04-2] "Not until wmf.19 is everywhere and won't regress. Also note that this is a deploy-trap; sync CommonSettings ahead of IS or it'll break the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [17:07:17] (03CR) 10jerkins-bot: [V: 04-1] Merge $wgLogo into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [17:07:26] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:08:48] (03Abandoned) 10Paladox: Bump Bazel version to 2.0.0 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/566890 (owner: 10Paladox) [17:08:56] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10wiki_willy) Hi @jijiki - @Cmjohnson is currently working on finishing up T236437, which also had a previous need by date of a month ago. Would the c... [17:12:42] (03CR) 10Elukey: [C: 03+2] Raise memcached threads to 8 (was: 4) on mc1025 [puppet] - 10https://gerrit.wikimedia.org/r/570370 (owner: 10Elukey) [17:25:58] (03PS1) 10Sbisson: Enable InukaPageView logging on production Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570381 (https://phabricator.wikimedia.org/T238029) [17:26:47] (03CR) 10Muehlenhoff: [C: 03+2] Switch cescout* to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/570316 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [17:27:09] (03CR) 10Alexandros Kosiaris: standard: Add linux-perf to standard packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/570254 (owner: 10Alexandros Kosiaris) [17:27:22] (03PS3) 10Alexandros Kosiaris: standard: Add linux-perf to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/570254 [17:28:38] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:28:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570371 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [17:29:37] (03CR) 10Alexandros Kosiaris: "fixed. thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/570254 (owner: 10Alexandros Kosiaris) [17:30:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/570254 (owner: 10Alexandros Kosiaris) [17:30:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] librenms API scrape alert: make critical & change name [puppet] - 10https://gerrit.wikimedia.org/r/570371 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [17:31:34] (03CR) 10CDanis: [C: 03+2] librenms API scrape alert: make critical & change name [puppet] - 10https://gerrit.wikimedia.org/r/570371 (https://phabricator.wikimedia.org/T224888) (owner: 10CDanis) [17:33:04] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.18/includes/: T244300 (duration: 01m 14s) [17:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:07] T244300: Argument 1 passed to Title::getLanguageConverter() must be an instance of Language, instance of StubUserLang given, called in /srv/mediawiki/php-1.35.0-wmf.18/includes/Title.php on line 207 - https://phabricator.wikimedia.org/T244300 [17:34:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/570159 (https://phabricator.wikimedia.org/T243935) (owner: 10Volans) [17:34:35] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.18/languages/: T244300 (duration: 01m 13s) [17:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:44] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:37:17] winning [17:43:28] (03CR) 10Dzahn: [C: 03+2] add IP addresses for new install servers on buster [dns] - 10https://gerrit.wikimedia.org/r/569679 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [17:44:08] PROBLEM - parsoid on scandium is CRITICAL: connect to address 10.64.48.94 and port 8142: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [17:44:27] ^ eh.. scandium is a test server [17:44:35] expired long downtime [17:44:48] not critical at all. downtiming it again [17:45:47] ACKNOWLEDGEMENT - parsoid on scandium is CRITICAL: connect to address 10.64.48.94 and port 8142: Connection refused daniel_zahn test server https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [17:47:33] re-enabling notifications that should not be disabled anymore (for other stuff). often those are forgotten because unlike downtimes they never expire by themselves [17:48:48] 10Operations, 10SRE-Access-Requests: Requesting access to Deployment for Clarakosi - https://phabricator.wikimedia.org/T244381 (10Clarakosi) [17:51:28] !log ganeti1017 - rebooting (not in use yet) [17:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:58] PROBLEM - Host ganeti1017 is DOWN: PING CRITICAL - Packet loss = 100% [17:53:06] 10Operations, 10ops-eqsin, 10Traffic: rack/setup/install ps[12]-60[34]-eqsin - https://phabricator.wikimedia.org/T242250 (10RobH) Update: I've coordinated with Jin via Google Hangout Messages and he has reviewed the rack and ensured he has all the cabled needed. I sent in this email to him, but since then... [17:53:15] RECOVERY - Host ganeti1017 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [17:53:30] 10Operations, 10ops-eqsin, 10Traffic: rack/setup/install ps[12]-60[34]-eqsin - https://phabricator.wikimedia.org/T242250 (10RobH) [17:54:09] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1017 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [17:54:51] PROBLEM - Logs skipped by trafficserver-tls on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/ATS [17:54:51] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:54:53] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5012 is CRITICAL: connect to address 10.132.0.112 and port 3121: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:54:55] PROBLEM - check_trafficserver_backend_config_status on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:55:03] PROBLEM - traffic-pool service on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:55:03] PROBLEM - TLS Lua configuration file on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/ATS [17:55:03] PROBLEM - Default ATS Lua configuration file on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/ATS [17:55:09] PROBLEM - check_trafficserver_log_fifo_notpurge_backend on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:55:09] PROBLEM - Confd vcl based reload on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Varnish [17:55:09] PROBLEM - Freshness of OCSP Stapling files -ATS-TLS acme-chief- on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [17:55:15] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:55:15] PROBLEM - check_trafficserver_log_fifo_purge_backend on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:55:15] PROBLEM - Webrequests Varnishkafka log producer on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [17:55:15] PROBLEM - Freshness of OCSP Stapling files -ATS-TLS- on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [17:55:21] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5012 is CRITICAL: connect to address 10.132.0.112 and port 3124: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:55:28] 10Operations, 10SRE-Access-Requests: Requesting access to Deployment for Clarakosi - https://phabricator.wikimedia.org/T244381 (10WDoranWMF) As @Clarakosi direct manager I approve this request as it is necessary for her to be able to deploy as part of her work on the Core Platform Team. [17:55:38] this is only cp5012 right? [17:55:43] PROBLEM - Check systemd state on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:43] PROBLEM - Ensure traffic_manager is running for instance backend on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:55:49] PROBLEM - Ensure traffic_manager is running for instance tls on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:55:49] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5012 is CRITICAL: connect to address 10.132.0.112 and port 3127: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:55:51] PROBLEM - Logs skipped by trafficserver on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/ATS [17:55:51] PROBLEM - configured eth on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [17:55:53] PROBLEM - Ensure traffic_server is running for instance tls on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:55:57] PROBLEM - Ensure traffic_server is running for instance backend on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:56:01] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:56:01] PROBLEM - confd service on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:56:11] Cc: vgutierrez, ema, bblack --^ [17:56:17] PROBLEM - IPMI Sensor Status on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:56:23] is anybody of you working on cp5012? [17:56:44] that looks so bad but it's only a single host at least [17:56:55] RECOVERY - parsoid on scandium is OK: HTTP OK: HTTP/1.1 200 OK - 1535 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [17:57:02] yes [17:57:09] new install? [17:57:10] cp5012 is depooled [17:57:11] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5012 is CRITICAL: connect to address 10.132.0.112 and port 3125: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:57:11] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5012 is CRITICAL: connect to address 10.132.0.112 and port 3122: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:57:14] alright [17:57:14] new install [17:57:18] ah okok [17:57:20] hmm let me check it [17:57:21] PROBLEM - Varnish HTTP text-frontend - port 80 on cp5012 is CRITICAL: connect to address 10.132.0.112 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:58:03] PROBLEM - MD RAID on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:58:03] PROBLEM - dhclient process on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [17:58:04] probably icinga just added these alerts a couple seconds ago? [17:58:15] PROBLEM - puppet last run on cp5012 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.112: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:58:21] want me to silence them or see them recover? [17:58:28] silence it please [17:59:22] done. for 4 hours or so [17:59:26] can do longer [17:59:56] thx [18:00:01] 10Operations, 10Traffic, 10netops, 10observability, 10Patch-For-Review: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) 05Open→03Resolved [18:00:07] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10jijiki) @wiki_willy hopefully it will help, but we generally believe that we will not be able to cope well again when we have sudden request spikes.... [18:02:13] 10Operations, 10Citoid, 10Core Platform Team Workboards (Clinic Duty Team): Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10jijiki) p:05Triage→03Normal [18:02:40] 10Operations, 10Citoid, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10jijiki) [18:08:32] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10wiki_willy) @jijiki - I'll talk to @Jclark-ctr and see if there's someway to expedite these. One of the current bottlenecks is getting rid of some o... [18:09:18] 10Operations, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) Thanks @Cmjohnson ! I rebooted ganeti1017 one more time because that fixes the [[ https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ganeti1017&service=Check... [18:12:05] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5012 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.471 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:12:39] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5012 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.470 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:13:11] RECOVERY - configured eth on cp5012 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:13:11] RECOVERY - Logs skipped by trafficserver on cp5012 is OK: OK: no matches found in journal for unit trafficserver https://wikitech.wikimedia.org/wiki/ATS [18:13:13] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5012 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.511 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:13:15] RECOVERY - Ensure traffic_server is running for instance tls on cp5012 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:13:17] RECOVERY - Ensure traffic_server is running for instance backend on cp5012 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:13:23] RECOVERY - confd service on cp5012 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:13:25] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp5012 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:13:25] RECOVERY - Logs skipped by trafficserver-tls on cp5012 is OK: OK: no matches found in journal for unit trafficserver-tls https://wikitech.wikimedia.org/wiki/ATS [18:13:25] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp5012 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:13:31] RECOVERY - check_trafficserver_backend_config_status on cp5012 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:13:33] sigh... sorry about the noise [18:13:45] RECOVERY - Default ATS Lua configuration file on cp5012 is OK: OK https://wikitech.wikimedia.org/wiki/ATS [18:13:45] RECOVERY - TLS Lua configuration file on cp5012 is OK: OK https://wikitech.wikimedia.org/wiki/ATS [18:13:53] RECOVERY - Freshness of OCSP Stapling files -ATS-TLS acme-chief- on cp5012 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [18:13:53] RECOVERY - Confd vcl based reload on cp5012 is OK: reload-vcl has not been executed yet. https://wikitech.wikimedia.org/wiki/Varnish [18:13:53] RECOVERY - check_trafficserver_log_fifo_notpurge_backend on cp5012 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /var/log/trafficserver/notpurge.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:14:01] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp5012 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:14:01] RECOVERY - Webrequests Varnishkafka log producer on cp5012 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [18:14:01] RECOVERY - Freshness of OCSP Stapling files -ATS-TLS- on cp5012 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [18:14:01] RECOVERY - check_trafficserver_log_fifo_purge_backend on cp5012 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /var/log/trafficserver/purge.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:14:31] RECOVERY - MD RAID on cp5012 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:14:31] RECOVERY - dhclient process on cp5012 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:14:41] RECOVERY - Ensure traffic_manager is running for instance backend on cp5012 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:14:43] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5012 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.470 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:14:43] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5012 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.470 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:14:47] RECOVERY - Ensure traffic_manager is running for instance tls on cp5012 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:14:55] RECOVERY - Varnish HTTP text-frontend - port 80 on cp5012 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.520 second response time https://wikitech.wikimedia.org/wiki/Varnish [18:14:55] RECOVERY - puppet last run on cp5012 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:16:12] (03PS1) 10Papaul: DHCP: Update DHCP file so the new mw servers can use Stretch and not Buster [puppet] - 10https://gerrit.wikimedia.org/r/570388 (https://phabricator.wikimedia.org/T241852) [18:20:19] (03CR) 10Effie Mouzeli: [C: 03+1] sre.switchdc.mediawiki: adapt to current status [cookbooks] - 10https://gerrit.wikimedia.org/r/570131 (https://phabricator.wikimedia.org/T243316) (owner: 10Volans) [18:21:14] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki: use cumin alias instead of role query [software/spicerack] - 10https://gerrit.wikimedia.org/r/570159 (https://phabricator.wikimedia.org/T243935) (owner: 10Volans) [18:21:18] 10Operations, 10Performance-Team, 10Traffic: Production load.php spends ~ 10% time doing output compression within PHP - https://phabricator.wikimedia.org/T242478 (10Krinkle) ###### Network | [Dashboard: Cluster overview (eqiad appservers)](https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&... [18:21:20] !log restart memcached on mc1025 with 8 threads (rollback - revert https://gerrit.wikimedia.org/r/#/c/570370/, run puppet, restart memcached) [18:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:27] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 159703264 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:23:10] !log rebooting cp5012 - T242093 [18:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:13] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [18:23:19] PROBLEM - Host cp5012 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:21] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:24:35] RECOVERY - Host cp5012 is UP: PING OK - Packet loss = 0%, RTA = 235.03 ms [18:25:41] RECOVERY - traffic-pool service on cp5012 is OK: OK - traffic-pool is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:25:52] (03PS1) 10Dzahn: site: add new ganeti hosts for refresh/expansion with spare role [puppet] - 10https://gerrit.wikimedia.org/r/570390 (https://phabricator.wikimedia.org/T228924) [18:26:07] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 44 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:26:55] (03CR) 10jerkins-bot: [V: 04-1] site: add new ganeti hosts for refresh/expansion with spare role [puppet] - 10https://gerrit.wikimedia.org/r/570390 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [18:27:00] disabled notifications for cp5012 (will need manual re-enable though) [18:27:11] mutante: I'll reenable it soon [18:27:42] vgutierrez: ACK, cool [18:27:43] (03CR) 10Krinkle: Merge $wgLogo into $wgLogos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [18:28:33] (03PS2) 10Dzahn: site: add new ganeti hosts for refresh/expansion with spare role [puppet] - 10https://gerrit.wikimedia.org/r/570390 (https://phabricator.wikimedia.org/T228924) [18:29:55] (03CR) 10Reedy: [C: 03+1] Set $wgLogos['1x'] (new style access) to $wgLogo (old style access) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570378 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [18:31:32] (03CR) 10Papaul: [C: 03+2] DHCP: Update DHCP file so the new mw servers can use Stretch and not Buster [puppet] - 10https://gerrit.wikimedia.org/r/570388 (https://phabricator.wikimedia.org/T241852) (owner: 10Papaul) [18:31:57] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 33 probes of 525 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:32:28] (03PS1) 10Dzahn: install_server: add install[12]003 to partman recipe regex [puppet] - 10https://gerrit.wikimedia.org/r/570392 (https://phabricator.wikimedia.org/T224576) [18:32:38] (03PS2) 10Dzahn: install_server: add install[12]003 to partman recipe regex [puppet] - 10https://gerrit.wikimedia.org/r/570392 (https://phabricator.wikimedia.org/T224576) [18:33:43] (03PS1) 10Ppchelko: Session Strore: Switch group0 and group1 to kask-session [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570393 (https://phabricator.wikimedia.org/T243106) [18:34:37] (03CR) 10Ppchelko: "to be rolled out on 02/06/2020" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570393 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [18:35:27] (03CR) 10Dzahn: [C: 03+2] install_server: add install[12]003 to partman recipe regex [puppet] - 10https://gerrit.wikimedia.org/r/570392 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [18:35:57] !log pooling cp5012 - T242093 [18:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:59] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [18:36:46] (03CR) 10RLazarus: [C: 03+2] dnsdisc: fix typo in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/570160 (owner: 10Volans) [18:39:30] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2310.codfw.wmnet ` The log can be found in `/var/log... [18:39:53] 10Operations, 10Release-Engineering-Team-TODO, 10Security-Team, 10Release-Engineering-Team (Deployment services), 10User-greg: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270 (10greg) 05Open→03Resolved a:03greg I think the combin... [18:41:52] (03CR) 10Jforrester: [C: 04-2] "a" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [18:42:00] (03PS1) 10Ppchelko: Session Store: Switch group2 to kask-transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570395 (https://phabricator.wikimedia.org/T243106) [18:42:02] (03PS1) 10Ppchelko: Session Store: Switch everything to kask-session [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570396 (https://phabricator.wikimedia.org/T243106) [18:42:13] 10Operations, 10Research, 10serviceops: Request for a in-memory caching data set for caching research - https://phabricator.wikimedia.org/T240503 (10jijiki) p:05Triage→03Low [18:42:55] (03CR) 10Ppchelko: "To be deployed on 02/10/2020" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570395 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [18:43:04] 10Operations: Remove mobrovac@wikimedia.org from techcom@wikimedia.org - https://phabricator.wikimedia.org/T244146 (10jijiki) p:05Triage→03Normal [18:43:43] 10Operations: Remove mobrovac@wikimedia.org from techcom@wikimedia.org - https://phabricator.wikimedia.org/T244146 (10Dzahn) 05Open→03Resolved a:03Dzahn done! cc: @mobrovac [18:43:49] PROBLEM - etherpad_up reduced availability on icinga1001 is CRITICAL: 0 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:43:55] (03CR) 10RLazarus: [C: 03+1] sre.switchdc.mediawiki: adapt to current status [cookbooks] - 10https://gerrit.wikimedia.org/r/570131 (https://phabricator.wikimedia.org/T243316) (owner: 10Volans) [18:44:16] 10Operations: Remove mobrovac@wikimedia.org from techcom@wikimedia.org - https://phabricator.wikimedia.org/T244146 (10jijiki) @Joe is this list on our end or OIT's ? [18:44:21] (03CR) 10Krinkle: Restore wgLogoHD to wikis without a MinervaCustomLogos defined (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570326 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [18:44:28] (03CR) 10Ppchelko: "To be deployed on 02/11 if everything is good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570396 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [18:44:53] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Itamar Givon to the ldap/wmde group - https://phabricator.wikimedia.org/T244148 (10jijiki) p:05Triage→03Normal a:03jijiki [18:45:32] 10Operations, 10LDAP-Access-Requests: Request LDAP access to the WMF group for Edna M - https://phabricator.wikimedia.org/T244176 (10jijiki) p:05Triage→03Normal [18:45:41] RECOVERY - etherpad_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:46:45] 10Operations, 10WMF-Blog-Social-Team, 10WMF-Communications, 10Wikimedia-Mailing-lists: Delete mailing list "worldcup2018" - https://phabricator.wikimedia.org/T244316 (10Dzahn) 05Open→03Resolved a:03Dzahn done! [fermium:~] $ sudo rmlist worldcup2018 [18:46:55] 10Operations, 10LDAP-Access-Requests: Request LDAP access to the WMF group for Edna M - https://phabricator.wikimedia.org/T244176 (10jijiki) @Edna We will need your Wikitech username in order to be able to add you to the WMF group, as well as an approval from your manager. Thank you! [18:47:24] 10Operations, 10observability, 10vm-requests: Provision grafana VM in codfw - https://phabricator.wikimedia.org/T244357 (10Dzahn) [18:48:03] 10Operations, 10Performance-Team, 10SRE-Access-Requests: Requesting access to deployment for dpifke - https://phabricator.wikimedia.org/T244183 (10jijiki) p:05Triage→03Normal a:03jijiki [18:48:20] 10Operations, 10ops-codfw, 10netops: codfw: Delete cloud interface-range - https://phabricator.wikimedia.org/T244196 (10jijiki) p:05Triage→03Normal [18:48:39] 10Operations, 10observability, 10vm-requests: Provision grafana VM in codfw - https://phabricator.wikimedia.org/T244357 (10Dzahn) [18:49:06] 10Operations, 10observability, 10vm-requests: Provision grafana VM in codfw - https://phabricator.wikimedia.org/T244357 (10Dzahn) added vm-requests tag and pasted vm-request form. please add the missing data above. [18:49:45] (03PS2) 10Sbisson: Enable InukaPageView logging on production Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570381 (https://phabricator.wikimedia.org/T238029) [18:50:24] 10Operations, 10Wikimedia-Mailing-lists: Allow list admins to train spam filters - https://phabricator.wikimedia.org/T244241 (10jijiki) @Aklapper we will have to dig into this a bit, thank you! [18:50:53] 10Operations, 10Wikimedia-Mailing-lists, 10serviceops: Allow list admins to train spam filters - https://phabricator.wikimedia.org/T244241 (10jijiki) p:05Triage→03Normal [18:51:02] 10Operations, 10serviceops: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10jijiki) p:05Triage→03Normal [18:51:48] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [18:53:39] 10Operations, 10SRE-Access-Requests: Requesting access to Deployment for Clarakosi - https://phabricator.wikimedia.org/T244381 (10Dzahn) a:03Dzahn [18:53:47] 10Operations, 10Wikimedia-Mailing-lists, 10serviceops: Allow list admins to train spam filters - https://phabricator.wikimedia.org/T244241 (10Reedy) https://blogs.gnome.org/ovitters/2008/06/07/using-moderated-messages-to-train-the-bayes-classifier/ >I’ve added a patch to Mailman [18:55:54] (03PS5) 10Dzahn: define 2 API appservers per row in codfw as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/564175 (https://phabricator.wikimedia.org/T242606) [18:56:57] (03CR) 10Dzahn: [C: 03+2] define 2 API appservers per row in codfw as canary API appservers [puppet] - 10https://gerrit.wikimedia.org/r/564175 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [18:57:39] 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244389 (10MarcoAurelio) Pinging SRE people as this is normally done during SRE onboarding when getting `deployment` or higher. [18:58:42] (03CR) 10RLazarus: [C: 03+1] mediawiki: use cumin alias instead of role query [software/spicerack] - 10https://gerrit.wikimedia.org/r/570159 (https://phabricator.wikimedia.org/T243935) (owner: 10Volans) [19:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200205T1900). [19:00:04] Ammarpad and niedzielski: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:55] (03CR) 10Dzahn: "no functional change on the hosts or in puppet by changing this role, it includes the same things. noop" [puppet] - 10https://gerrit.wikimedia.org/r/564175 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [19:01:48] (03PS2) 10Jforrester: Merge $wgLogo into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) [19:01:53] o/ It looks like James_F has already merged my patch! Thanks James_F! [19:03:17] niedzielski: Happy to help. Jon said it was urgent. :-) [19:04:09] Ammarpad isn't around, it seems? [19:04:15] 💯👍 [19:04:15] stephanebisson: You here? [19:04:22] (03CR) 10Jforrester: [C: 03+2] Set $wgLogos['1x'] (new style access) to $wgLogo (old style access) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570378 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [19:04:27] James_F: yep [19:04:37] Cool, will deploy yours. [19:04:47] James_F: But let's wait for my patch, I need to recheck something first [19:05:14] (03CR) 10Jforrester: [C: 03+2] Enable InukaPageView logging on production Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570381 (https://phabricator.wikimedia.org/T238029) (owner: 10Sbisson) [19:05:16] (03Merged) 10jenkins-bot: Set $wgLogos['1x'] (new style access) to $wgLogo (old style access) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570378 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [19:05:29] Ok, stopping. [19:05:36] (03CR) 10Jforrester: [C: 03+1] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570381 (https://phabricator.wikimedia.org/T238029) (owner: 10Sbisson) [19:06:59] 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn) The following are now declared canary API appservers in site.pp: mw2215, mw2216 (rack A3) mw2244, mw2245 (rack A4) [19:08:44] James_F: please go ahead [19:09:41] (03PS3) 10Jforrester: Enable InukaPageView logging on production Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570381 (https://phabricator.wikimedia.org/T238029) (owner: 10Sbisson) [19:09:45] (03CR) 10Jforrester: [C: 03+2] Enable InukaPageView logging on production Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570381 (https://phabricator.wikimedia.org/T238029) (owner: 10Sbisson) [19:10:16] !log jforrester@deploy1001 scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [19:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:33] Hmm. Not good. [19:10:45] (03Merged) 10jenkins-bot: Enable InukaPageView logging on production Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570381 (https://phabricator.wikimedia.org/T238029) (owner: 10Sbisson) [19:10:50] Hmm. [19:12:35] (03PS1) 10Jforrester: Revert "Set $wgLogos['1x'] (new style access) to $wgLogo (old style access)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570402 [19:12:45] (03CR) 10Jforrester: [C: 03+2] Revert "Set $wgLogos['1x'] (new style access) to $wgLogo (old style access)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570402 (owner: 10Jforrester) [19:12:50] * James_F sighs. [19:13:08] stephanebisson: Sorry, one moment. [19:13:46] (03Merged) 10jenkins-bot: Revert "Set $wgLogos['1x'] (new style access) to $wgLogo (old style access)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570402 (owner: 10Jforrester) [19:14:34] stephanebisson: OK, live on mwdebug1001 – can you test? [19:15:21] 10Operations, 10serviceops: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10jijiki) The idea is obviously sensible. I do have some concerns about how this will perform with our loaded mwservers. We could wait to test this afte... [19:15:28] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Sync back revert of 975b4bbb9 (duration: 01m 06s) [19:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:10] 10Operations, 10observability, 10vm-requests: Provision grafana VM in codfw - https://phabricator.wikimedia.org/T244357 (10jijiki) p:05Triage→03High [19:16:12] 10Operations, 10observability, 10serviceops, 10vm-requests: Provision grafana VM in codfw - https://phabricator.wikimedia.org/T244357 (10jijiki) [19:17:09] James_F: on it [19:18:09] James_F: all good [19:18:17] OK. [19:19:41] (03PS3) 10Jforrester: Merge $wgLogo into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) [19:19:42] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T238029 Enable InukaPageView logging on production Wikipedias (duration: 01m 07s) [19:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:45] T238029: Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 [19:19:59] 10Operations, 10Core Platform Team, 10MediaWiki-Parser, 10serviceops, and 2 others: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10daniel) [19:20:23] 10Operations, 10MediaWiki-Parser, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Wikimedia-Incident: API action=parse should be poolcounter-limited if a re-parse is necessary - https://phabricator.wikimedia.org/T243803 (10daniel) [19:20:26] OK, SWAT done. [19:21:01] James_F: thanks [19:22:25] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2310.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2310.codfw.wmnet'] ` [19:23:35] (03PS2) 10Bartosz Dziewoński: Clean up VisualEditor settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550535 [19:23:38] 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10MW-1.35-notes (1.35.0-wmf.16; 2020-01-21), and 2 others: Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10SBisson) This was enabled in production just now. [19:23:50] James_F: [19:23:53] bah. [19:23:56] James_F: [19:24:01] (03PS3) 10Bartosz Dziewoński: Clean up VisualEditor settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550535 [19:24:15] James_F: while you're there, want to do that patch too? ^ i removed the problematic part, i'll do it separately [19:24:35] 10Operations, 10Wikimedia-Incident: Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10RLazarus) [19:24:55] (03PS1) 10Dzahn: site: define 2 codfw appservers as canary_appservers [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) [19:25:41] 10Operations, 10Wikimedia-Incident: Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10RLazarus) p:05Triage→03Normal [19:25:48] (03CR) 10Dzahn: "2 ... or should i do 4 more? That would be 6 in total since we already have 2 (mwdebug2*) and you said "at least 4"." [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [19:27:52] 10Operations, 10serviceops, 10Patch-For-Review: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10Dzahn) [19:29:45] (03PS1) 10Bartosz Dziewoński: Fix incorrect spellings of "RESTBase" in config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570409 [19:29:47] (03PS1) 10Bartosz Dziewoński: Fix incorrect spellings of "RESTBase" in config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570410 [19:30:11] (03CR) 10Bartosz Dziewoński: "@James I removed the problematic part from this commit, doing it separately in https://gerrit.wikimedia.org/r/570409 + https://gerrit.wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550535 (owner: 10Bartosz Dziewoński) [19:32:12] (03CR) 10Dzahn: [C: 03+2] "indeed does not seem to be used anywhere, also checked openstackbrowser" [puppet] - 10https://gerrit.wikimedia.org/r/570169 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [19:33:02] MatmaRex: Oh, hey. Sorry. Looking now. [19:33:17] it's not important if you're working on somethng else now [19:34:39] (03CR) 10Jforrester: [C: 03+2] Clean up VisualEditor settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550535 (owner: 10Bartosz Dziewoński) [19:34:54] MatmaRex: Just the OOUI release and some UBN follow-ups. :-) [19:35:06] MatmaRex: Trade for C+2 on https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/570412 ? ;-) [19:35:11] (03PS1) 10Dzahn: wmcs::toolsdb_secondary: fix a comment about what this class does [puppet] - 10https://gerrit.wikimedia.org/r/570414 [19:35:26] (03PS2) 10Ammarpad: Add assigment of 'mover' group to bureaucrats on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566925 (https://phabricator.wikimedia.org/T243503) [19:35:34] (03CR) 10Brion VIBBER: "Thanks Gilles!" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/568646 (https://phabricator.wikimedia.org/T228467) (owner: 10Brion VIBBER) [19:35:40] (03Merged) 10jenkins-bot: Clean up VisualEditor settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550535 (owner: 10Bartosz Dziewoński) [19:36:31] (03PS2) 10Dzahn: wmcs::toolsdb_secondary: fix a comment about what this class does [puppet] - 10https://gerrit.wikimedia.org/r/570414 [19:37:15] (03CR) 10Dzahn: [C: 03+2] "comments only" [puppet] - 10https://gerrit.wikimedia.org/r/570414 (owner: 10Dzahn) [19:38:04] 10Operations, 10vm-requests: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 (10MoritzMuehlenhoff) [19:38:42] !log restart mjolnir-kafka-bulk-daemon across eqiad, daemons appear stuck and not reading new messages [19:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:44] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Clean up VisualEditor settings (duration: 01m 07s) [19:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:06] (03PS3) 10Ammarpad: Add assignment of 'mover' group to bureaucrats on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566925 (https://phabricator.wikimedia.org/T243503) [19:39:29] 10Operations, 10vm-requests: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 (10Dzahn) Hmm.. fair enough. That means my DNS change was not correct though, it defined public IPs as before. [19:39:33] (03CR) 10Muehlenhoff: "I think 4 is fine, in eqiad we currently have 1261-1265 e.g." [puppet] - 10https://gerrit.wikimedia.org/r/570405 (https://phabricator.wikimedia.org/T242606) (owner: 10Dzahn) [19:39:52] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10jbond) @Krenair i think the package needed is `puppet-terminus-puppetdb` which is provided by the `puppetdb` source package. I have looked at buil... [19:41:11] 10Operations, 10vm-requests: VM requests for install_server replacements - https://phabricator.wikimedia.org/T244390 (10Dzahn) [19:43:13] (03CR) 10Mholloway: [C: 03+2] Remove handler deleted from the MachineVision extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566859 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie) [19:43:35] (03CR) 10Dzahn: [C: 03+2] admins: add Sakti Pramudya to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/567563 (https://phabricator.wikimedia.org/T243802) (owner: 10Dzahn) [19:43:44] (03PS2) 10Dzahn: admins: add Sakti Pramudya to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/567563 (https://phabricator.wikimedia.org/T243802) [19:44:38] !log LDAP - added spramduya to wmf group (T243802) [19:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:41] T243802: Request for LDAP access to the WMF group for Sakti Pramudya - https://phabricator.wikimedia.org/T243802 [19:45:42] (03PS6) 10Ammarpad: Enable lead paragraph in user namespace on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) [19:46:07] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Request for LDAP access to the WMF group for Sakti Pramudya - https://phabricator.wikimedia.org/T243802 (10Dzahn) 05Open→03Resolved @SpramudyaDev You have been added to the "wmf" group. You should now be able to login with the same credentials use... [19:46:35] (03CR) 10Mholloway: [C: 03+2] "Oh, this is being held up on the wmf.18 backport, but shouldn't be since it's for labs. I'll force-merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566859 (https://phabricator.wikimedia.org/T241242) (owner: 10Matthias Mullie) [19:48:27] 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team Workboards (Clinic Duty Team): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10Dzahn) [19:49:52] 10Operations: Remove mobrovac@wikimedia.org from techcom@wikimedia.org - https://phabricator.wikimedia.org/T244146 (10Dzahn) @jijiki On our end in private repo. puppetmaster1001:/srv/private/modules/privateexim/files (already done though) [19:50:22] (03PS3) 10Ammarpad: Enable new user message for auto-created accounts on zh_classical wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567306 (https://phabricator.wikimedia.org/T243509) [19:55:52] (03PS1) 10Jforrester: [nlwiki] Enable VisualEditor in the Project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570419 (https://phabricator.wikimedia.org/T159711) [19:56:53] (03CR) 10Muehlenhoff: [C: 04-1] "The email address is incomplete" [puppet] - 10https://gerrit.wikimedia.org/r/567563 (https://phabricator.wikimedia.org/T243802) (owner: 10Dzahn) [19:58:47] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) [19:58:50] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Dzahn) [19:59:53] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Dzahn) See T243983. I added a second disk to this VM, it's an additional 10GB and mounted on /srv/dbdump. Hope that does it. [20:00:05] twentyafterfour and marxarelli: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - American Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200205T2000). [20:00:49] !log installing unzip security updates [20:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:08] (03PS3) 10Dzahn: admins: add Sakti Pramudya to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/567563 (https://phabricator.wikimedia.org/T243802) [20:01:26] (03CR) 10Dzahn: "thanks for the catch! fixed." [puppet] - 10https://gerrit.wikimedia.org/r/567563 (https://phabricator.wikimedia.org/T243802) (owner: 10Dzahn) [20:07:20] (03PS1) 10Dzahn: Revert "add IP addresses for new install servers on buster" [dns] - 10https://gerrit.wikimedia.org/r/570423 [20:09:43] !log installing git security updates for jessie [20:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:49] !log Preparing to deploy wmf/1.35.0-wmf.18 to group1 wikis refs T233866 [20:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:52] T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 [20:14:41] (03CR) 10Muehlenhoff: [C: 03+1] admins: add Sakti Pramudya to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/567563 (https://phabricator.wikimedia.org/T243802) (owner: 10Dzahn) [20:18:24] PROBLEM - Host mw2311 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:05] !log joal@deploy1001 Started deploy [analytics/hdfs-tools/deploy@714e2d0]: Deploy bug fix version [20:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:13] !log joal@deploy1001 Finished deploy [analytics/hdfs-tools/deploy@714e2d0]: Deploy bug fix version (duration: 00m 08s) [20:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:17] !log mw1267 restarting php7.2-fpm [20:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:52] !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw1267.eqiad.wmnet [20:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:08] ok I guess it's safe to go ahead with train for group1? [20:29:46] 10Operations, 10Core Platform Team, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10daniel) If this is what MediaWiki's MainStash is using, then this is also used by chronology protector. We'd have to move it to something else. Pinging @aaron for that. [20:30:09] twentyafterfour: if that was for me, i think so, we just had issues with a single server [20:30:56] 10Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287 (10Dzahn) mw1267 was showing temperature issues today: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=now-2d&to=now&fullscreen&panelId=25&var-server=mw1267&var-datasou... [20:32:15] mutante: thanks, yeah everything seems stable going ahead with the train [20:32:59] (03PS1) 1020after4: group1 wikis to 1.35.0-wmf.18 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570429 [20:33:03] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.35.0-wmf.18 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570429 (owner: 1020after4) [20:34:09] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.18 refs T233866 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570429 (owner: 1020after4) [20:34:51] !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw1269.eqiad.wmnet [20:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:35] !log joal@deploy1001 Started deploy [analytics/refinery@a47f0d5]: Analytics regular weekly deploy [20:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:52] (03CR) 10Volans: [C: 03+2] mediawiki: use cumin alias instead of role query [software/spicerack] - 10https://gerrit.wikimedia.org/r/570159 (https://phabricator.wikimedia.org/T243935) (owner: 10Volans) [20:42:23] (03Merged) 10jenkins-bot: mediawiki: use cumin alias instead of role query [software/spicerack] - 10https://gerrit.wikimedia.org/r/570159 (https://phabricator.wikimedia.org/T243935) (owner: 10Volans) [20:42:25] (03Merged) 10jenkins-bot: dnsdisc: fix typo in docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/570160 (owner: 10Volans) [20:44:12] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.18 refs T233866 [20:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:15] T233866: 1.35.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T233866 [20:45:19] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.18 refs T233866 (duration: 01m 07s) [20:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:46] (03CR) 10Dzahn: [C: 03+2] admins: add Sakti Pramudya to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/567563 (https://phabricator.wikimedia.org/T243802) (owner: 10Dzahn) [20:48:32] (03CR) 10BryanDavis: [C: 03+1] dynamicproxy: urlproxy: introduce support for domain-based routing [puppet] - 10https://gerrit.wikimedia.org/r/565556 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [20:48:53] PROBLEM - Check systemd state on ores1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:25] !log ores1004 - systemctl start celery-ores-worker [20:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:44] RECOVERY - Check systemd state on ores1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:03] !log joal@deploy1001 Finished deploy [analytics/refinery@a47f0d5]: Analytics regular weekly deploy (duration: 13m 28s) [20:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:18] !log joal@deploy1001 Started deploy [analytics/refinery@a47f0d5] (thin): Analytics regular weekly deploy [20:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:25] !log joal@deploy1001 Finished deploy [analytics/refinery@a47f0d5] (thin): Analytics regular weekly deploy (duration: 00m 07s) [20:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:03] RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [20:58:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:00:04] cscott, arlolra, subbu, halfak, and accraze: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200205T2100). [21:03:07] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T244407 (10CGlenn) [21:03:07] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T244407 (10CGlenn) [21:04:27] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:08:46] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for CherRaye Glenn - https://phabricator.wikimedia.org/T244410 (10CGlenn) [21:08:47] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for CherRaye Glenn - https://phabricator.wikimedia.org/T244410 (10CGlenn) [21:12:43] 10Operations: Remove mobrovac@wikimedia.org from techcom@wikimedia.org - https://phabricator.wikimedia.org/T244146 (10jijiki) Thank you Daniel! [21:12:43] 10Operations: Remove mobrovac@wikimedia.org from techcom@wikimedia.org - https://phabricator.wikimedia.org/T244146 (10jijiki) Thank you Daniel! [21:25:31] hmm [21:25:44] is wikibugs reporting twice now? [21:26:32] apparently so [21:31:05] !log killing and restarting wikibugs, it was reporting each update twice [21:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:46] mutante robbed my idea :P [21:33:55] !log arlolra@deploy1001 Started deploy [parsoid/deploy@01d9d3d]: Updating Parsoid to 74730a3 [21:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:26] hauskatze: except it does not come back and is worse now ?:( [21:34:42] mutante: yes it comes, when there's something to report [21:35:05] wikibugs joins -cloud by default, and others when there's activity to report - and then stays [21:35:32] and given that I don't have bash installed in this PC, I thank you for restarting the thing [21:37:01] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@01d9d3d]: Updating Parsoid to 74730a3 (duration: 03m 07s) [21:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:11] it is working mutante [21:38:11] 10Operations, 10Shinken: Make the Shinken IRC alert and icinga-wm bots use colors - https://phabricator.wikimedia.org/T113785 (10Dzahn) test update [21:38:19] hauskatze: ^ confirmed :) [21:38:26] I told you :) [21:39:47] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Buster - https://phabricator.wikimedia.org/T224585 (10bd808) 05Open→03Resolved [21:39:50] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10bd808) [21:40:27] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10jijiki) >>! In T244058#5851362, @Joe wrote: > Instead of caching, we should just rate-limit parsing of old revisions to N concurrent revisions per us... [21:40:49] (03CR) 10Cwhite: [C: 03+1] "I don't see anything particularly problematic. LGTM" (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/557050 (owner: 10Jbond) [21:45:45] 10Operations, 10Performance-Team, 10serviceops, 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10jijiki) [21:47:06] (03CR) 10Ottomata: [C: 03+2] Update labstore mediawiki-history readme file [puppet] - 10https://gerrit.wikimedia.org/r/566822 (https://phabricator.wikimedia.org/T243426) (owner: 10Joal) [21:48:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1061.eqiad.wmnet - https://phabricator.wikimedia.org/T238624 (10Jclark-ctr) [21:48:07] (03CR) 10Eevans: "We should probably wait until we better understand the 404 rate (https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=15809" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570393 (https://phabricator.wikimedia.org/T243106) (owner: 10Ppchelko) [21:50:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1062.eqiad.wmnet - https://phabricator.wikimedia.org/T239188 (10Jclark-ctr) [21:56:54] (03PS9) 10Joal: Add profile::analytics::refinery::job::import_wikidata_entites_dumps [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) [21:57:18] (03CR) 10BryanDavis: [C: 03+2] Report error messages on stderr [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496565 (owner: 10BryanDavis) [21:57:32] (03CR) 10BryanDavis: [C: 03+2] Remove lighttpd-precise handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496566 (owner: 10BryanDavis) [21:57:46] (03CR) 10Joal: "@elukey: This is ready except for the need to choose whether to use 1st block syntax (variables) or 2nd block syntax (single-line)." [puppet] - 10https://gerrit.wikimedia.org/r/567954 (https://phabricator.wikimedia.org/T209655) (owner: 10Joal) [21:57:49] (03CR) 10BryanDavis: [C: 03+2] Improve support for extra_args [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496567 (owner: 10BryanDavis) [21:57:59] (03Merged) 10jenkins-bot: Report error messages on stderr [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496565 (owner: 10BryanDavis) [21:58:11] (03Merged) 10jenkins-bot: Remove lighttpd-precise handling [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496566 (owner: 10BryanDavis) [21:58:21] (03CR) 10BryanDavis: [C: 03+2] Rename internal "toollabs" package to "toolforge" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563605 (owner: 10BryanDavis) [21:58:30] (03Merged) 10jenkins-bot: Improve support for extra_args [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496567 (owner: 10BryanDavis) [21:58:34] (03CR) 10jerkins-bot: [V: 04-1] Rename internal "toollabs" package to "toolforge" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563605 (owner: 10BryanDavis) [22:01:22] (03CR) 10BryanDavis: [C: 03+2] Deprecate Jessie based Kubernetes types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565807 (owner: 10BryanDavis) [22:01:22] (03CR) 10jerkins-bot: [V: 04-1] Rename internal "toollabs" package to "toolforge" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563605 (owner: 10BryanDavis) [22:01:24] (03CR) 10jerkins-bot: [V: 04-1] Deprecate Jessie based Kubernetes types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565807 (owner: 10BryanDavis) [22:02:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1066.eqiad.wmnet - https://phabricator.wikimedia.org/T233071 (10Jclark-ctr) [22:04:55] 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244389 (10Dzahn) a:03Dzahn [22:07:26] !log Gerrit - added ppchelko to 'wmf-deployment' Gerrit group (he is already in deployment admin group) (T244389) [22:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:29] T244389: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244389 [22:10:31] 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244389 (10Dzahn) As @MarcoAurelio points out this normally goes together with getting the deployment admin group. Petr is already a member of that (and various oth... [22:10:58] 10Operations, 10Gerrit-Privilege-Requests, 10SRE-Access-Requests: Request for +2 access to mediawiki-config - https://phabricator.wikimedia.org/T244389 (10Dzahn) 05Open→03Resolved @Pchelolo This should work now. [22:18:12] (03CR) 10RLazarus: [C: 03+1] profile::mediawiki::php: raise number of workers on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/570255 (owner: 10Giuseppe Lavagetto) [22:24:32] (03CR) 10Papaul: [C: 03+1] Revert "add IP addresses for new install servers on buster" [dns] - 10https://gerrit.wikimedia.org/r/570423 (owner: 10Dzahn) [22:26:05] (03CR) 10Dzahn: [C: 03+2] Revert "add IP addresses for new install servers on buster" [dns] - 10https://gerrit.wikimedia.org/r/570423 (owner: 10Dzahn) [22:32:04] 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: https://www.youtube.com/watch?v=_R47Cnv_cPs - https://phabricator.wikimedia.org/T244278 (10elhistorial) a:05CDanis→03RuyP [22:33:25] ^ who has the tools to clean up phab vandalism? [22:33:50] rlazarus: me [22:34:11] blocked [22:34:16] hero <3 [22:34:22] user already disabled [22:34:55] 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Tracking task: 2020-02-04 kartotherian outage - https://phabricator.wikimedia.org/T244278 (10Reedy) a:05RuyP→03CDanis [22:35:05] hauskatze: but we are not talking about an "undo" button ? [22:35:26] mutante: Phabricator is not that fancy [22:35:39] (03PS1) 10CDanis: add cdanis as super-user, also add 'next UID' tracker comment [homer/public] - 10https://gerrit.wikimedia.org/r/570437 [22:35:48] you'll need to do that by hand [22:35:52] 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Tracking task: 2020-02-04 kartotherian outage - https://phabricator.wikimedia.org/T244278 (10Reedy) [22:35:56] oh sorry, if I knew we were reverting by hand I'd've just done it [22:36:00] thanks Reedy [22:36:29] Phab spam is on the rise again :/ [22:37:00] hauskatze: there is more [22:37:01] Did you reverted already? I was tuning spotify to do it :P [22:39:21] (03CR) 10Ayounsi: [C: 03+1] add cdanis as super-user, also add 'next UID' tracker comment [homer/public] - 10https://gerrit.wikimedia.org/r/570437 (owner: 10CDanis) [22:43:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1067.eqiad.wmnet - https://phabricator.wikimedia.org/T238297 (10Jclark-ctr) [22:43:25] (03PS2) 10Clarakosi: Add restbase202[123] to hiera [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178) [22:50:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission auth1001 - https://phabricator.wikimedia.org/T234909 (10Jclark-ctr) [22:57:19] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@a51f927]: Update mobileapps to a7928fa [22:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:51] (03CR) 10Urbanecm: "Good point, didn't realize that" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567306 (https://phabricator.wikimedia.org/T243509) (owner: 10Ammarpad) [23:00:14] (03CR) 10Urbanecm: [C: 03+1] Enable new user message for auto-created accounts on zh_classical wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567306 (https://phabricator.wikimedia.org/T243509) (owner: 10Ammarpad) [23:03:22] (03PS7) 10BryanDavis: Rename internal "toollabs" package to "toolforge" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563605 [23:03:24] (03PS4) 10BryanDavis: Deprecate Jessie based Kubernetes types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565807 [23:04:37] (03CR) 10BryanDavis: [C: 03+2] Rename internal "toollabs" package to "toolforge" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563605 (owner: 10BryanDavis) [23:05:20] (03Merged) 10jenkins-bot: Rename internal "toollabs" package to "toolforge" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563605 (owner: 10BryanDavis) [23:05:22] (03Merged) 10jenkins-bot: Deprecate Jessie based Kubernetes types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/565807 (owner: 10BryanDavis) [23:08:06] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@a51f927]: Update mobileapps to a7928fa (duration: 10m 48s) [23:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:32] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krenair) ugh, ok [23:15:05] (03PS1) 10Papaul: DHCP: Add wdqs200[7-8] to netboot.cfg and MAC address [puppet] - 10https://gerrit.wikimedia.org/r/570465 (https://phabricator.wikimedia.org/T242301) [23:15:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10Jclark-ctr) [23:18:12] (03PS1) 10Dzahn: add private IPs for new install servers [dns] - 10https://gerrit.wikimedia.org/r/570468 (https://phabricator.wikimedia.org/T224576) [23:18:35] (03CR) 10jerkins-bot: [V: 04-1] add private IPs for new install servers [dns] - 10https://gerrit.wikimedia.org/r/570468 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [23:19:42] (03CR) 10Dzahn: "why not buster?" [puppet] - 10https://gerrit.wikimedia.org/r/570465 (https://phabricator.wikimedia.org/T242301) (owner: 10Papaul) [23:23:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10Jclark-ctr) [23:27:39] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10Jclark-ctr) [23:30:22] !log delete search indices duplicated on multiple clusters for: hywwiki, chrwiktionary, gcrwiki, mnwwiki, noboard_chapterswikimedia nqowiki nrmwiki outreachwiki and srnwiki [23:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:52] 10Operations, 10Beta-Cluster-Infrastructure: Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster) - https://phabricator.wikimedia.org/T243226 (10Krenair) Have put puppetmaster03 back on the old version and created puppetmaster04 [23:32:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission dbproxy1006.eqiad.wmnet - https://phabricator.wikimedia.org/T233207 (10Jclark-ctr) [23:40:17] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) [23:41:06] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) [23:48:49] 10Operations, 10ops-eqiad, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10Jclark-ctr) [23:49:41] 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10MW-1.35-notes (1.35.0-wmf.16; 2020-01-21), 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10nshahquinn-wmf) 05Open→03Resolved I'm seeing events flowing into the production database, so I...