[00:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200212T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:00:06] I've personally seen 2-3 OTRS tickets about it, plus the discourse thread [00:00:12] AntiComposite: while I regret there's no real content there right now, soon https://phabricator.wikimedia.org/T244278 will be propagated with stuff [00:00:26] There's been someone on IRC with poor english skills who is struggling to understand why it's broken and what is next [00:00:41] James_F: I'd rebase KartographerIconServer removal onto HEAD and get it merged [00:00:55] yeah, that's what I've been referring to, unfortunately it requires both technical knowledge and WM system knowledge to know what it means [00:01:00] Reedy: Psh, just merge my patches, coward. ;-) [00:01:02] indeed [00:01:59] (03PS1) 10Reedy: Drop $wgKartographerIconServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571600 [00:02:23] someone's filed an issue on the kartotherian repo on GitHub as well [00:02:45] The abandonware repo? [00:03:18] heh heh [00:03:29] well, not sure what yuri's doing with it, but the one we're no longer involved with ;) [00:03:30] https://github.com/kartotherian/kartotherian [00:03:36] Oh, the pre-fork one. [00:04:08] yep. i've been waiting for a bit more clarity before responding. [00:05:36] https://github.com/kartotherian/kartotherian/issues/121 [00:07:23] "Please share an update, as we need to be aware whether the project is still active or not." [00:07:33] Which is completely irrelevant as to whether a specific tileserver is accessible or not [00:07:59] https://wiki.openstreetmap.org/wiki/Tile_servers probably wants updating too [00:09:00] I can do that if there's a public update [00:13:42] (03CR) 10Eric Gardner: [C: 03+1] "Thanks for posting this so quickly. Adding "localhost" URLs to the whitelist would be very helpful on my team at the moment." [puppet] - 10https://gerrit.wikimedia.org/r/571596 (https://phabricator.wikimedia.org/T244278) (owner: 10CDanis) [00:17:02] (03CR) 10Reedy: [C: 03+2] Drop $wgKartographerIconServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571600 (owner: 10Reedy) [00:17:10] (03CR) 10Jforrester: [C: 03+1] Drop $wgKartographerIconServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571600 (owner: 10Reedy) [00:18:22] (03Merged) 10jenkins-bot: Drop $wgKartographerIconServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571600 (owner: 10Reedy) [00:18:45] why is it always writing the tests... [00:20:07] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: rm wgKartographerIconServer (duration: 01m 03s) [00:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:22] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: rm wgKartographerIconServer (duration: 01m 02s) [00:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:33] (03PS4) 10CDanis: maps 429ing: allow localhost URLs, to unblock devel [puppet] - 10https://gerrit.wikimedia.org/r/571596 (https://phabricator.wikimedia.org/T244278) [00:32:03] (03CR) 10CDanis: [C: 03+2] "0 tests failed, 0 tests skipped, 16 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/571596 (https://phabricator.wikimedia.org/T244278) (owner: 10CDanis) [00:33:15] !log commit full on cr1-eqsin - T243080 [00:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:19] T243080: Upgrade routers - https://phabricator.wikimedia.org/T243080 [00:34:32] (03PS1) 10Ayounsi: Revert "Depool eqsin for router upgrade" [dns] - 10https://gerrit.wikimedia.org/r/571607 [00:38:10] !log reboot cr1-eqsin [00:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:17] (03PS3) 10Jforrester: [doc-only] Clarify meaings of Wikibase Terms migration settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571467 (owner: 10Tarrow) [00:38:23] (03CR) 10Jforrester: [C: 03+2] [doc-only] Clarify meaings of Wikibase Terms migration settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571467 (owner: 10Tarrow) [00:39:15] (03Merged) 10jenkins-bot: [doc-only] Clarify meaings of Wikibase Terms migration settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571467 (owner: 10Tarrow) [00:39:25] (03CR) 10Jforrester: [C: 03+1] "Now that we're on 1.35.0-wmf.18 and look like we aren't going to roll back, this is safe to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567193 (https://phabricator.wikimedia.org/T242556) (owner: 10Ammarpad) [00:42:39] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 84, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:43:21] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:43:39] PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:44:30] device is up but still initializing its interfaces [00:44:38] interfaces coming up [00:44:43] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.19/includes/page/ImageHistoryPseudoPager.php: T244937 ImageHistoryPseudoPager: Update doQuery() for IndexPager changes (duration: 01m 03s) [00:44:44] all interfaces up [00:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:04] T244937: [Regression 1.35.0-wmf.19] i/p/IndexPager.php:* PHP Warning: implode(): Invalid arguments passed - https://phabricator.wikimedia.org/T244937 [00:45:19] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:45:37] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:46:04] snmp seems much happier [00:46:31] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 86, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:46:34] load avg of 2 or 3, but seems to be calming down [00:49:00] BGP is still converging, yeah [00:50:38] BGP: 520113 routes [00:50:44] BGP: 548732 routes [00:50:46] for v4 [00:51:30] cdanis@re0.cr1-eqsin> show route summary [00:51:32] error: the routing subsystem is not running [00:51:34] ??? [00:52:02] uh, I was running cr1-eqsin> show route summary | match BGP [00:52:05] so it's new [00:52:19] error: the routing subsystem is not responding to management requests [00:52:32] eqsin is still DNS depooled right? lol [00:52:36] yeah [00:53:23] BGP: 14377 routes, 13344 active [00:53:37] seems like rpd crashed and restarted [00:53:41] wtf [00:54:15] weird, is a full routing table too chonky for cr1 after the upgrade? [00:54:49] that's the symptoms at least [00:57:02] I disabled its local transit [00:57:11] but it will get a full view anyway from cr2 [00:58:12] RE crashed again [00:58:30] highest I saw for ipv4 routes while watching was BGP: 541475 routes, 250056 active [00:58:32] Feb 12 00:57:45 re0.cr1-eqsin rpd[2947]: JTASK_ASSERT: Assertion failed rpd[2947]: file "../../../../../../../../src/junos/usr.sbin/rpd/bgp/bgp_io.c", line 2854: "attr_len == apart->apfmt_len" [00:58:32] Feb 12 00:57:45 re0.cr1-eqsin rpd[2947]: JTASK_ABORT: abort rpd[2947] version 17.3R3-S6.3 built by builder on 2019-09-19 20:09:29 UTC: Invalid argument [00:58:34] this? [00:59:22] that sure looks like a crash [00:59:35] ayounsi@re0.cr1-eqsin> show system core-dumps [00:59:37] they're all there [00:59:53] heh [01:00:18] disabling transit6 as well so we don't spam the world [01:00:28] I have to go get dinner [01:00:36] if it's not already too late [01:00:53] that's not how I planned on spending my evening [01:00:54] ... [01:00:57] :( [01:01:14] I am going to get dinner as well, but I'll be around later, let me know if I can help [01:02:04] opening a Jtac case first [01:05:05] eqsin still depooled I assume [01:05:09] yes [01:06:40] yeah I wouldn't worry too much about it then, it's unfortunate but not an emergency [01:06:46] we can keep eqsin depooled and sort this out in the morning [01:07:58] filed a tracking task https://phabricator.wikimedia.org/T244944 [01:08:00] and now I'm really off [01:08:40] thx [01:10:35] attaching the core dump and going for dinner [01:11:12] will attach RSI later as I'm sure they will refuse to look at the coredump without an RSI... [01:48:25] !log disabling peering session on cr1-eqsin (they're flapping otherwise) [01:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:17] (03PS1) 10Herron: logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T234854) [04:05:11] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [04:06:37] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:07:47] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [04:11:37] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [04:14:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:14:47] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:16:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:55] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [06:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:54] (03PS1) 10Marostegui: site.pp: Remove dbproxy1001 puppet references [puppet] - 10https://gerrit.wikimedia.org/r/571628 (https://phabricator.wikimedia.org/T244463) [06:20:20] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove dbproxy1001 puppet references [puppet] - 10https://gerrit.wikimedia.org/r/571628 (https://phabricator.wikimedia.org/T244463) (owner: 10Marostegui) [06:22:15] (03PS1) 10Marostegui: wmnet: Remove dbproxy1001 production DNS [dns] - 10https://gerrit.wikimedia.org/r/571629 (https://phabricator.wikimedia.org/T244463) [06:22:55] !log depool cp30[53-54] and reimage as buster - T242093 [06:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:59] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [06:23:06] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove dbproxy1001 production DNS [dns] - 10https://gerrit.wikimedia.org/r/571629 (https://phabricator.wikimedia.org/T244463) (owner: 10Marostegui) [06:29:25] PROBLEM - ores uWSGI web app on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:35] PROBLEM - ores uWSGI web app on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:35] PROBLEM - MD RAID on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:39] PROBLEM - Check systemd state on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:39] PROBLEM - MD RAID on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:45] PROBLEM - MD RAID on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:45] PROBLEM - Check size of conntrack table on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:45] PROBLEM - Check size of conntrack table on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:53] PROBLEM - configured eth on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:29:55] PROBLEM - Disk space on ores1008 is CRITICAL: connect to address 10.64.48.27 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops [06:29:59] PROBLEM - Check systemd state on ores1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:01] PROBLEM - MD RAID on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:30:03] PROBLEM - Disk space on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1002&var-datasource=eqiad+prometheus/ops [06:30:05] PROBLEM - Check whether ferm is active by checking the default input chain on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:15] PROBLEM - Disk space on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2008&var-datasource=codfw+prometheus/ops [06:30:15] PROBLEM - configured eth on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:17] PROBLEM - Check size of conntrack table on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:23] PROBLEM - DPKG on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:25] PROBLEM - Check systemd state on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:25] PROBLEM - puppet last run on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:30:31] PROBLEM - Disk space on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops [06:30:31] PROBLEM - dhclient process on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:35] PROBLEM - dhclient process on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:41] PROBLEM - DPKG on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:43] PROBLEM - Check size of conntrack table on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:45] PROBLEM - Check whether ferm is active by checking the default input chain on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:47] PROBLEM - dhclient process on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:49] PROBLEM - ores uWSGI web app on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:51] PROBLEM - Check whether ferm is active by checking the default input chain on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:53] PROBLEM - DPKG on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:59] PROBLEM - Check systemd state on ores2008 is CRITICAL: connect to address 10.192.48.89 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:07] PROBLEM - Check systemd state on ores1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:09] PROBLEM - configured eth on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:31:31] RECOVERY - MD RAID on ores1008 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:31:41] RECOVERY - Check size of conntrack table on ores1008 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:31:49] RECOVERY - Disk space on ores1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1008&var-datasource=eqiad+prometheus/ops [06:33:41] RECOVERY - configured eth on ores2008 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:33:55] RECOVERY - Check whether ferm is active by checking the default input chain on ores2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:34:05] RECOVERY - Disk space on ores2008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores2008&var-datasource=codfw+prometheus/ops [06:34:31] RECOVERY - DPKG on ores2008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:34:33] RECOVERY - Check size of conntrack table on ores2008 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:34:37] RECOVERY - dhclient process on ores2008 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:34:41] PROBLEM - puppet last run on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:34:49] RECOVERY - Check systemd state on ores2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:25] RECOVERY - MD RAID on ores2008 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:37:29] PROBLEM - Check the NTP synchronisation status of timesyncd on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:39:17] RECOVERY - MD RAID on ores1002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:39:19] RECOVERY - Check size of conntrack table on ores1002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:39:35] RECOVERY - Disk space on ores1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1002&var-datasource=eqiad+prometheus/ops [06:39:49] RECOVERY - configured eth on ores1002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:39:59] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:05] RECOVERY - dhclient process on ores1002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:40:19] RECOVERY - Check whether ferm is active by checking the default input chain on ores1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:40:27] RECOVERY - DPKG on ores1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:40:35] RECOVERY - puppet last run on ores1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:40:41] RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:01] PROBLEM - ores_workers_running on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:43:21] RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:11] !log Redact ngwikimedia on db1124:3313 and db2094:3313 T240772 [06:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:15] T240772: Prepare and check storage layer for ngwikimedia - https://phabricator.wikimedia.org/T240772 [06:46:50] RECOVERY - Check size of conntrack table on ores1005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:46:54] RECOVERY - DPKG on ores1005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:47:04] RECOVERY - Disk space on ores1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops [06:47:08] RECOVERY - dhclient process on ores1005 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:47:28] RECOVERY - Check whether ferm is active by checking the default input chain on ores1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:47:34] RECOVERY - puppet last run on ores1005 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:49:57] sigh [06:50:06] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [06:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:16] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [06:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:24] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:43] RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:12] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/571634 [06:54:22] running puppet on ores to be sure that all is up [06:54:37] but it is the logrotate again [06:54:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:02] RECOVERY - MD RAID on ores1005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:56:51] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::data_purge: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/571634 (owner: 10Elukey) [07:00:08] RECOVERY - configured eth on ores1005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [07:00:34] (03PS1) 10Marostegui: report_users.sh: Remove dbproxy1001 IP [software] - 10https://gerrit.wikimedia.org/r/571637 [07:01:53] (03CR) 10Marostegui: [C: 03+2] report_users.sh: Remove dbproxy1001 IP [software] - 10https://gerrit.wikimedia.org/r/571637 (owner: 10Marostegui) [07:02:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1107 with weight 20 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10391 and previous config saved to /var/cache/conftool/dbconfig/20200212-070250-marostegui.json [07:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:55] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [07:06:04] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:12] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [07:06:30] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:06:44] apergos: this one is for you, OOMed too [07:06:50] RECOVERY - ores_workers_running on ores1005 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [07:06:56] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:06:57] Out of memory: Kill process 7819 (python3) [07:07:08] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [07:07:10] I'll check in a secodn [07:07:23] somebody running a big jupyter kernerl probably [07:07:26] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:07:30] ah ha [07:07:52] RECOVERY - Check the NTP synchronisation status of timesyncd on ores1005 is OK: OK: synced at Wed 2020-02-12 07:07:51 UTC. https://wikitech.wikimedia.org/wiki/NTP [07:07:55] jupyterhub cannot allocate memory [07:08:07] eh? [07:08:12] PROBLEM - Disk space on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [07:08:46] sorry for the wrong ping a.pergos ;) [07:08:49] the notebooks can when running kernels like python [07:08:57] it's fine :-) [07:09:08] elukey: the full error is [07:09:09] Feb 12 07:03:19 notebook1004 jupyterhub[26708]: [E 2020-02-12 07:03:19.662 JupyterHub ioloop:792] Exception in callback functools.partial(.null_wrapper at 0x7fc5587ad378>, ) [07:09:24] with a stacktrace that reports OSError: [Errno 12] Cannot allocate memory [07:09:26] RECOVERY - DPKG on notebook1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:09:49] yes yes will check in a second, people not respecting memory limits [07:09:52] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:09:54] :D [07:10:06] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [07:10:17] I added systemd slices for the users, but jupyter notebooks are created as systemd units under the system slice [07:10:22] so I suppose they escape [07:10:31] neverending game [07:10:36] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: raise number of workers on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/570255 [07:10:42] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [07:11:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: raise number of workers on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/570255 (owner: 10Giuseppe Lavagetto) [07:12:19] (03PS1) 10Elukey: role::analytics_cluster::coordinator: use kerberos for data_purge [puppet] - 10https://gerrit.wikimedia.org/r/571641 [07:12:59] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: use kerberos for data_purge [puppet] - 10https://gerrit.wikimedia.org/r/571641 (owner: 10Elukey) [07:13:04] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:16:45] !log oblivian@puppetmaster1001 conftool action : set/weight=20; selector: dc=eqiad,pool=appserver,service=nginx,name=mw12[3-5].* [07:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:45] !log oblivian@puppetmaster1001 conftool action : set/weight=30; selector: dc=eqiad,pool=appserver,name=mw132[3-4].* [07:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:49] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:28:18] !log pool cp30[53-54] running buster - T242093 [07:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:22] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [07:40:29] (03PS1) 10Marostegui: site.pp: Productionize dbproxy1015 [puppet] - 10https://gerrit.wikimedia.org/r/571659 (https://phabricator.wikimedia.org/T202367) [07:41:45] (03PS2) 10Marostegui: site.pp: Productionize dbproxy1015 [puppet] - 10https://gerrit.wikimedia.org/r/571659 (https://phabricator.wikimedia.org/T202367) [07:46:29] PROBLEM - Host db1095 is DOWN: PING CRITICAL - Packet loss = 100% [07:46:43] uh [07:47:14] vgutierrez: part of any reimage in progress? [07:47:27] :? [07:47:32] I don't reimage db hosts [07:47:43] * volans wake up1 [07:47:53] marostegui: ^^ ? [07:47:54] sorry, apparently my brain is fried [07:48:05] it's an s2 slave [07:48:11] let me depool it [07:49:35] ok, it's not yet in MW config [07:50:01] s/yet// [07:50:34] so nothing to depool [07:50:50] checking [07:50:56] it is a backup host [07:50:59] a backup source [07:51:11] vgutierrez: how dare you reimaging databases [07:51:14] :D [07:52:01] (03PS1) 10Vgutierrez: Release 8.0.6-rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/571660 [07:52:13] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.6-rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/571660 (owner: 10Vgutierrez) [07:52:23] (03CR) 10Volans: [C: 03+2] sre.switchdc.mediawiki: adapt to current status [cookbooks] - 10https://gerrit.wikimedia.org/r/570131 (https://phabricator.wikimedia.org/T243316) (owner: 10Volans) [07:52:38] https://phabricator.wikimedia.org/T244958 [07:53:16] thx [07:53:35] 2/2 bad pings for me this morning, I should probably reboot myself [07:54:35] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: adapt to current status [cookbooks] - 10https://gerrit.wikimedia.org/r/570131 (https://phabricator.wikimedia.org/T243316) (owner: 10Volans) [07:55:47] (03PS1) 10Marostegui: db1095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/571661 (https://phabricator.wikimedia.org/T244958) [07:56:50] volans: hope you don't crash-loop after restart like those eqsin Junipers :-) [07:56:52] (03CR) 10Marostegui: [C: 03+2] db1095: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/571661 (https://phabricator.wikimedia.org/T244958) (owner: 10Marostegui) [07:57:01] RECOVERY - Host db1095 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [07:57:07] moritzm: lol [07:58:01] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: update Phabricator API call [puppet] - 10https://gerrit.wikimedia.org/r/566876 (https://phabricator.wikimedia.org/T159045) (owner: 10Volans) [07:59:08] (03CR) 10Volans: [C: 03+2] "Merging and testing it" [puppet] - 10https://gerrit.wikimedia.org/r/566872 (https://phabricator.wikimedia.org/T159045) (owner: 10Volans) [08:00:26] (03PS1) 10Vgutierrez: install_server: Reimage {upload,text}@codfw as buster [puppet] - 10https://gerrit.wikimedia.org/r/571662 (https://phabricator.wikimedia.org/T242093) [08:02:27] (03PS2) 10Muehlenhoff: netmon: Switch to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571537 (https://phabricator.wikimedia.org/T156955) [08:11:44] (03CR) 10Muehlenhoff: [C: 03+2] netmon: Switch to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571537 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:15:51] (03CR) 10Ema: [C: 03+1] "Nice!" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/571660 (owner: 10Vgutierrez) [08:19:13] (03PS1) 10Fdans: dumps:Add license footer to all analytics dumps pages [puppet] - 10https://gerrit.wikimedia.org/r/571672 (https://phabricator.wikimedia.org/T244685) [08:20:13] (03CR) 10jerkins-bot: [V: 04-1] dumps:Add license footer to all analytics dumps pages [puppet] - 10https://gerrit.wikimedia.org/r/571672 (https://phabricator.wikimedia.org/T244685) (owner: 10Fdans) [08:21:17] !log installing mesa security updates [08:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:44] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 36 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:22:05] unexpected blank line 😑 [08:22:42] (03CR) 10Ema: [C: 03+1] install_server: Reimage {upload,text}@codfw as buster [puppet] - 10https://gerrit.wikimedia.org/r/571662 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [08:24:50] (03PS2) 10Fdans: dumps:Add license footer to all analytics dumps pages [puppet] - 10https://gerrit.wikimedia.org/r/571672 (https://phabricator.wikimedia.org/T244685) [08:24:58] (03CR) 10Effie Mouzeli: [C: 03+1] "LGTM! one last tiny comment: In the subject of comments we usually add the puppet module or software this commit is about, so here we coul" [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178) (owner: 10Clarakosi) [08:30:52] (03PS3) 10DannyS712: Remove "Create a book" link on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561403 (https://phabricator.wikimedia.org/T241683) [08:32:22] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 35 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:35:06] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:35:22] RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [08:38:19] !log Restart wikibugs as it doesn't show phab comments on irc - T241109 [08:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:23] T241109: wikibugs needs restart almost everyday - https://phabricator.wikimedia.org/T241109 [08:38:26] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM events in all hosts - https://phabricator.wikimedia.org/T242705 (10akosiaris) Seems like the deploy did not fix it after all. Most (if not all) hosts alerted this morning. It's evident in the graphs as well https://grafana.wikimedia.or... [08:38:55] 10Operations, 10Wikibugs: wikibugs needs restart almost everyday - https://phabricator.wikimedia.org/T241109 (10Marostegui) I have given this another restart and it started now to show comments again [08:41:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM. No no, no need to go with the star name, irc2001 is fine." [dns] - 10https://gerrit.wikimedia.org/r/571563 (https://phabricator.wikimedia.org/T244719) (owner: 10Dzahn) [08:41:36] 10Operations, 10Wikibugs: wikibugs needs restart almost everyday - https://phabricator.wikimedia.org/T241109 (10ema) 05Resolved→03Open >>! In T241109#5875447, @Marostegui wrote: > I have given this another restart and it started now to show comments again Reopening given that manual restarts are still nee... [08:44:37] 10Operations, 10Gerrit, 10Patch-For-Review: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Marostegui) >>! In T243800#5874562, @Dzahn wrote: > @Marostegui I made some changes to make the db_user and db_pass configurable for gerrit. Thing is just i don't kn... [08:44:55] (03PS3) 10Elukey: add irc2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/571563 (https://phabricator.wikimedia.org/T244719) (owner: 10Dzahn) [08:46:39] !log phedenskog@deploy1001 Started deploy [performance/navtiming@9bbbb58]: (no justification provided) [08:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:44] !log phedenskog@deploy1001 Finished deploy [performance/navtiming@9bbbb58]: (no justification provided) (duration: 00m 05s) [08:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:12] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Itamar Givon to the ldap/wmde group - https://phabricator.wikimedia.org/T244148 (10ItamarWMDE) Thank you @Dzahn [08:48:21] (03CR) 10Elukey: [C: 03+2] add irc2001.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/571563 (https://phabricator.wikimedia.org/T244719) (owner: 10Dzahn) [08:51:12] 10Operations, 10Analytics, 10serviceops, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) ` elukey@ganeti2001:~$ sudo gnt-group list Group Nodes Instances AllocPolicy NDParams row_A 4 34 preferred ovs=False, ssh_po... [08:52:32] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:55:01] (03PS1) 10Muehlenhoff: Add library hint for mesa [puppet] - 10https://gerrit.wikimedia.org/r/571676 [08:55:22] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [08:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:23] (03PS3) 10Marostegui: site.pp: Productionize dbproxy1015 [puppet] - 10https://gerrit.wikimedia.org/r/571659 (https://phabricator.wikimedia.org/T202367) [08:58:42] 10Operations, 10Analytics, 10serviceops, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10MoritzMuehlenhoff) Does this really need 8 GB RAM and 8 CPUs? The machine that this will replace (kraz) uses a single CPU (and hardly uses it) and... [08:58:43] (03CR) 10Marostegui: [C: 03+2] site.pp: Productionize dbproxy1015 [puppet] - 10https://gerrit.wikimedia.org/r/571659 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [08:58:54] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for mesa [puppet] - 10https://gerrit.wikimedia.org/r/571676 (owner: 10Muehlenhoff) [08:59:41] (03PS1) 10Ema: ATS: add profile::trafficserver::tls::http_settings to labs hieradata [puppet] - 10https://gerrit.wikimedia.org/r/571678 [09:00:56] 10Operations, 10Analytics, 10serviceops, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) >>! In T244719#5875487, @MoritzMuehlenhoff wrote: > Does this really need 8 GB RAM and 8 CPUs? The machine that this will replace (kraz) u... [09:01:05] 10Operations, 10Analytics, 10serviceops, 10vm-requests, and 2 others: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) [09:02:18] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 38 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:08:06] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 35 probes of 524 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:11:00] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: Add reqId/file/line to php7-fatal-error.php's 'message' field [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [09:11:31] !log Upgrade and reboot dbproxy1013 before making it master - T202367 [09:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:36] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [09:13:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [09:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:03] (03PS1) 10Marostegui: wmnet: Promote dbproxy1013 as m2-master [dns] - 10https://gerrit.wikimedia.org/r/571679 (https://phabricator.wikimedia.org/T202367) [09:16:13] (03PS12) 10Ema: ATS: add webrequest logging for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/562535 (https://phabricator.wikimedia.org/T237993) [09:17:35] (03CR) 10Marostegui: [C: 03+2] wmnet: Promote dbproxy1013 as m2-master [dns] - 10https://gerrit.wikimedia.org/r/571679 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [09:17:57] !log Failover m2 master dbproxy from dbproxy1007 to dbproxy1013 - T202367 [09:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:01] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [09:18:01] 10Operations, 10Analytics, 10serviceops, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm codfw_B --link public --memory 8 --disk 40 --vcpus 4 irc2001.wikimedia.org START... [09:18:19] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add ES 7 compatible logstash template [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:18:35] (03CR) 10Ema: [C: 03+2] ATS: add webrequest logging for atskafka [puppet] - 10https://gerrit.wikimedia.org/r/562535 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [09:22:45] (03CR) 10Filippo Giunchedi: "LGTM, see nit inline, also would be nice to get a PCC run" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571548 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:25:08] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571554 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:25:30] !log rolling restart of cassandra on restbase-dev to pick up Java security updates [09:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:05] !log cp4027: ats-tls-restart to enable analytics logging to pipe T237993 [09:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:11] T237993: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 [09:27:39] (03PS1) 10Elukey: Introduce irc2001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/571680 (https://phabricator.wikimedia.org/T244719) [09:28:48] (03CR) 10Elukey: [C: 03+2] Introduce irc2001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/571680 (https://phabricator.wikimedia.org/T244719) (owner: 10Elukey) [09:29:29] ah snap there is a mistake [09:30:19] (03PS1) 10Giuseppe Lavagetto: profile::services_proxy: allow using envoy [puppet] - 10https://gerrit.wikimedia.org/r/571682 [09:31:19] (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: allow using envoy [puppet] - 10https://gerrit.wikimedia.org/r/571682 (owner: 10Giuseppe Lavagetto) [09:31:20] (03PS1) 10Elukey: install_server: fix dhcp config for irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/571683 (https://phabricator.wikimedia.org/T244719) [09:31:47] _joe_ your patch is great, I'll use it for the analytics hosts when ready [09:32:28] (03CR) 10Elukey: [C: 03+2] install_server: fix dhcp config for irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/571683 (https://phabricator.wikimedia.org/T244719) (owner: 10Elukey) [09:33:30] <_joe_> elukey: which patch? [09:33:35] * _joe_ is confused [09:34:40] _joe_ the one for profile::services_proxy, but it might not be the profile that I use [09:34:54] <_joe_> I hope you don't :D [09:35:06] <_joe_> also the patch is completely wrong, that's why it's marked wip [09:35:23] <_joe_> it's just a leftover from when I first tried to fix services proxy [09:35:54] it is profile::tlsproxy::service, I have a task to move to envoy [09:35:55] :) [09:36:05] confused the names for a secodn [09:38:37] !log cp: rolling ats-tls-restart to enable analytics logging T237993 [09:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:40] T237993: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 [09:39:51] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Volans) [09:43:12] (03Abandoned) 10Ema: ATS: add profile::trafficserver::tls::http_settings to labs hieradata [puppet] - 10https://gerrit.wikimedia.org/r/571678 (owner: 10Ema) [09:43:49] (03PS1) 10Elukey: install_server: add partition config for irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/571685 (https://phabricator.wikimedia.org/T244719) [09:44:11] 10Operations, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) [09:46:10] 10Operations, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) [09:46:31] 10Operations, 10ops-eqiad: Degraded RAID on db1095 - https://phabricator.wikimedia.org/T244972 (10ops-monitoring-bot) [09:46:46] this was me testing the latest changes to the raid_handler ^^^ [09:46:49] haha [09:46:52] I was like: what? [09:46:55] And then I saw the IGNORE [09:46:55] XD [09:46:58] (03PS2) 10Elukey: install_server: add partition config for irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/571685 (https://phabricator.wikimedia.org/T244719) [09:47:42] 10Operations, 10ops-eqiad: Degraded RAID on db1095 - https://phabricator.wikimedia.org/T244972 (10Volans) 05Open→03Invalid This was a test, sorry for the noise. [09:47:50] marostegui: I think that this deserves a cumin cumin [09:47:52] ok all seems to work fine, task opened fine [09:47:57] hahahaha [09:47:58] 10Operations, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) @elukey thank you for unblocking this !!! [09:48:19] (03CR) 10Elukey: [C: 03+2] install_server: add partition config for irc2001 [puppet] - 10https://gerrit.wikimedia.org/r/571685 (https://phabricator.wikimedia.org/T244719) (owner: 10Elukey) [09:55:49] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: fix new lines in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/570062 (owner: 10Effie Mouzeli) [09:56:04] (03PS2) 10Effie Mouzeli: hieradata: fix new lines in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/570062 [10:01:14] !log depool cp30[51-52] and reimage as buster - T242093 [10:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:30] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [10:02:08] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp3051.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:02:49] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage {upload,text}@codfw as buster [puppet] - 10https://gerrit.wikimedia.org/r/571662 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [10:05:51] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp3052.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:06:21] !log depool cp20[23,26] and reimage as buster - T242093 [10:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:58] (03PS1) 10Elukey: admin: add kerberos flag for user foks [puppet] - 10https://gerrit.wikimedia.org/r/571687 (https://phabricator.wikimedia.org/T244773) [10:08:50] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2023.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [10:09:47] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag for user foks [puppet] - 10https://gerrit.wikimedia.org/r/571687 (https://phabricator.wikimedia.org/T244773) (owner: 10Elukey) [10:12:19] !log testing trafficserver 8.0.6-rc0 in cp40[26,32] [10:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:38] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2026.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121014_vgutie... [10:22:17] 10Operations, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) created /srv/sqldata.s2 on db1140 and ran: ` transfer.py --type=decompress --no-encrypt --no-checksum dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s2.2020-02-12--01-22-05.tar.... [10:22:59] (03PS1) 10Vgutierrez: Revert "ATS: Disable KA on cp4031" [puppet] - 10https://gerrit.wikimedia.org/r/571688 (https://phabricator.wikimedia.org/T244464) [10:23:14] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:17] (03CR) 10jerkins-bot: [V: 04-1] Revert "ATS: Disable KA on cp4031" [puppet] - 10https://gerrit.wikimedia.org/r/571688 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:25:26] (03PS2) 10Vgutierrez: Revert "ATS: Disable KA on cp4031" [puppet] - 10https://gerrit.wikimedia.org/r/571688 (https://phabricator.wikimedia.org/T244464) [10:25:28] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:10] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:25] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2023.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2023.codfw.wmnet'] ` [10:28:19] (03CR) 10Ema: [C: 03+1] Revert "ATS: Disable KA on cp4031" [puppet] - 10https://gerrit.wikimedia.org/r/571688 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:28:27] (03CR) 10Vgutierrez: [C: 03+2] Revert "ATS: Disable KA on cp4031" [puppet] - 10https://gerrit.wikimedia.org/r/571688 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:29:04] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:28] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:56] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:48] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:21] !log Enable KA between ats-tls and varnish-fe on cp4031 - T244464 [10:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:26] T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 [10:34:02] !log bouncing ferm on ganeti1016, failed to start after boot [10:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:06] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3051.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3051.esams.wmnet'] ` [10:34:11] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:10] (03PS1) 10Ema: vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) [10:35:45] (03PS1) 10Vgutierrez: ATS: Don't assume that http_settings is mandatory on ats-tls profile [puppet] - 10https://gerrit.wikimedia.org/r/571690 (https://phabricator.wikimedia.org/T244464) [10:35:47] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2026.codfw.wmnet'] ` and were **ALL** successful. [10:36:53] !log pool cp2023 running buster - T242093 [10:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:57] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [10:37:22] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/20749/" [puppet] - 10https://gerrit.wikimedia.org/r/571690 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:38:28] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3052.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3052.esams.wmnet'] ` [10:39:06] (03CR) 10jerkins-bot: [V: 04-1] ATS: Don't assume that http_settings is mandatory on ats-tls profile [puppet] - 10https://gerrit.wikimedia.org/r/571690 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:40:28] (03PS2) 10Vgutierrez: ATS: Don't assume that http_settings is mandatory on ats-tls profile [puppet] - 10https://gerrit.wikimedia.org/r/571690 (https://phabricator.wikimedia.org/T244464) [10:41:49] (03PS2) 10Ema: vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) [10:42:49] !log pool cp2026 running buster - T242093 [10:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:53] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [10:42:56] (03CR) 10jerkins-bot: [V: 04-1] vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [10:44:04] (03CR) 10Ema: [C: 03+1] "Tested in labs" [puppet] - 10https://gerrit.wikimedia.org/r/571690 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:44:45] (03CR) 10Vgutierrez: [C: 03+2] ATS: Don't assume that http_settings is mandatory on ats-tls profile [puppet] - 10https://gerrit.wikimedia.org/r/571690 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:45:49] !log depool cp20[19,25] and reimage as buster - T242093 [10:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:26] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2019.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121046_vgutie... [10:47:12] (03PS3) 10Ema: vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) [10:48:16] (03CR) 10jerkins-bot: [V: 04-1] vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [10:48:56] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for jmorgan - https://phabricator.wikimedia.org/T244785 (10jbond) @Capt_Swing Thanks i had also asked for your sshkey but i see you already have a shell account so you are just requesting the additional membership of `analytics-privatedata-user... [10:49:09] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2025.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121048_vgutie... [10:49:24] !log pool cp30[51,52] running buster - T242093 [10:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:28] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [10:50:23] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [10:50:32] !log depool cp3050 and reimage as buster - T242093 [10:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:25] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp3050.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121051_vgutie... [10:52:10] (03PS4) 10Ema: vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) [10:55:17] (03PS1) 10Jcrespo: backups: Migrate db1095 backup source instances (s2, s3) to db1140 [puppet] - 10https://gerrit.wikimedia.org/r/571693 (https://phabricator.wikimedia.org/T244958) [11:00:52] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:10] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:35] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:52] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 5227 MB (3% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [11:05:37] It seems like css loading has been sketchy for me at the moment... [11:05:51] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:15] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2025.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2025.codfw.wmnet'] ` [11:08:30] (03PS1) 10Jcrespo: backups: Disable db1140 notifications [puppet] - 10https://gerrit.wikimedia.org/r/571696 (https://phabricator.wikimedia.org/T244958) [11:08:47] (03PS1) 10Ema: vcl: simplify backend listing [puppet] - 10https://gerrit.wikimedia.org/r/571697 (https://phabricator.wikimedia.org/T241239) [11:09:09] (03CR) 10Marostegui: [C: 03+1] backups: Migrate db1095 backup source instances (s2, s3) to db1140 [puppet] - 10https://gerrit.wikimedia.org/r/571693 (https://phabricator.wikimedia.org/T244958) (owner: 10Jcrespo) [11:10:25] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2019.codfw.wmnet'] ` and were **ALL** successful. [11:11:11] !log reimage logstash2026 to test new standard RAID0 partman recipe [11:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:15] 10Operations, 10SRE-Access-Requests: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10jbond) [11:11:50] (03CR) 10jerkins-bot: [V: 04-1] vcl: simplify backend listing [puppet] - 10https://gerrit.wikimedia.org/r/571697 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:12:17] (03CR) 10Ema: "pcc here: https://puppet-compiler.wmflabs.org/compiler1001/20750/" [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:12:59] I kind of suspect that https://phabricator.wikimedia.org/T244985 might be a regression from most recent deploy [11:13:58] 10Operations, 10SRE-Access-Requests: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10jbond) @mepps are you able to apporove this request @Dwisehaupt are you able to provide an ssh public key noting that "This ssh key pair should only be used for WMF cluster access,... [11:14:07] 10Operations, 10SRE-Access-Requests: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10jbond) p:05Triage→03Medium [11:14:16] !log pool cp2019 running buster - T242093 [11:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:20] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [11:14:57] 10Operations, 10DBA, 10Patch-For-Review: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) Now running: ` transfer.py --type=decompress --no-encrypt --no-checksum dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s3.2020-02-12--05-46-09.tar.gz db1140... [11:15:27] !log depool cp2016 and reimage as buster - T242093 [11:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:00] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2016.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121115_vgutie... [11:16:27] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:21] (03CR) 10Jbond: [C: 03+2] puppet_compiler: add rich_data support (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/557050 (https://phabricator.wikimedia.org/T236481) (owner: 10Jbond) [11:17:54] (03Merged) 10jenkins-bot: puppet_compiler: add rich_data support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/557050 (https://phabricator.wikimedia.org/T236481) (owner: 10Jbond) [11:17:58] !log pool cp2025 running buster - T242093 [11:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:21] !log depool cp2024 and reimage as buster - T242093 [11:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:00] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2024.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121118_vgutie... [11:19:38] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [11:19:45] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:04] volans: we have an icinga alert for the SMART device on cloudvirt1009 T244986 but I can't find what's wrong [11:23:05] T244986: cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 [11:23:07] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10aborrero) [11:23:22] could you please help me understand what's going on? [11:23:53] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3050.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3050.esams.wmnet'] ` [11:24:18] (03PS5) 10Ema: vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) [11:24:22] So I'm getting pages on mediawiki.org loading like 1 out of ever 5-ish times with no css [11:24:33] But I can't exactly figure out what's wrong with them [11:24:48] Like the network tab in firefox isn't reporting any errors or things not loaded [11:25:33] Although it is reporting that https://www.mediawiki.org/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=timeless has a size of 0 bytes [11:25:49] But if I look at the response tab, it clearly has the JS payload there [11:25:52] * bawolff confused [11:27:21] 10Operations, 10Analytics, 10serviceops, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) Ok current status: * irc2001.wikimedia.org is running * puppet is set to role::system::spare, waiting for a new role/cluster combinati... [11:27:25] bawolff: what does the X-Cache response header say? [11:27:55] cp4027 miss, cp4032 hit/2 [11:28:09] And its a 304 not modified response [11:28:23] but other 304's report the size of the cached payload not literally 0 [11:28:48] vgutierrez: you're testing ats 8.0.6-rc0 on cp4032, right? ^ [11:28:57] (03PS1) 10Jbond: update version 0.6.0 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/571700 [11:29:01] ema: yes [11:29:21] (03CR) 10Jbond: [V: 03+2 C: 03+2] update version 0.6.0 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/571700 (owner: 10Jbond) [11:30:03] Not to mention it looks like all the css is missing, and that url is a js file [11:30:18] uh [11:30:27] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:29] bawolff: so how sane request looks? [11:31:37] *how a sane request looks? :) [11:32:28] In terms of size, they report reasonable sizes (like the main css reports 20.28kb) [11:32:45] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:51] and has x-cache of cp4029 miss, cp4032 hit/2 [11:33:22] one of the css requests is marked as (raced) [11:33:26] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:38] bawolff: https://www.mediawiki.org/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=timeless tat's the CSS url? [11:34:50] no, that's the javascript url [11:35:22] The two css urls in the network tab are https://www.mediawiki.org/w/load.php?lang=en&modules=ext.gadget.site-styles&only=styles&skin=timeless and [11:35:25] https://www.mediawiki.org/w/load.php?lang=en&modules=ext.echo.styles.badge%7Cext.uls.pt%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.content.externallinks%7Cmediawiki.ui.button%7Coojs-ui.styles.icons-alerts%7Cskins.timeless&only=styles&skin=timeless [11:35:33] ack [11:35:42] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:46] so in those you're getting cl: 0 if I understood correctly [11:36:16] I'm only getting the 0 byte report for the javascript url. Which is weird. The css clearly is not loaded [11:36:32] I'm honestly very confused what is happening [11:36:56] hmmm [11:36:58] so [11:37:01] https://www.mediawiki.org/w/load.php?lang=en&modules=ext.gadget.site-styles&only=styles&skin=timeless [11:37:10] vgutierrez: can we try depooling the two hosts with rc0 and see if bawolff can reproduce? [11:37:11] None of them literally have a content-length header. Its just what firefox reports the response size as [11:37:11] that URL for me on esams is returning JS as well [11:37:23] ema: sure [11:37:26] And if i go directly to the url they return stuff [11:37:34] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2016.codfw.wmnet'] ` and were **ALL** successful. [11:37:35] !log depooling cp[4026,4032] [11:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:11] done [11:39:05] !log pool cp3050 running buster - T242093 [11:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:09] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [11:39:20] bawolff: can you still reproduce the issue? [11:40:23] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2024.codfw.wmnet'] ` and were **ALL** successful. [11:40:36] Seems to have gone away [11:40:54] It wasn't happening on every request in the first place, but i just tried loading a lot of pages, and it seems to have stopped [11:41:09] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10aborrero) 05Open→03Resolved a:03aborrero Apparently the issue resolved itself: {F31609733} [11:41:42] hmm [11:41:46] I'm missing something about this issue [11:41:48] vgutierrez@cp3052:~$ curl -v -s "https://www.mediawiki.org/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=timeless" -o /dev/null [11:41:57] < content-length: 61684 [11:42:03] < content-type: text/javascript; charset=utf-8 [11:42:09] but apparently that should be CSS? [11:42:22] x-cache looks like this: < x-cache: cp3062 miss, cp3052 hit/1 [11:42:23] No, that is correctly javascript [11:42:36] so which URL provides CSS? :_) [11:42:47] (03PS1) 10Jbond: puppet_compiler: bump the version [puppet] - 10https://gerrit.wikimedia.org/r/571703 [11:42:55] On the page, JS and CSS were broken, but that url in firefox network tab for JS was showing up weirdly [11:43:16] https://www.mediawiki.org/w/load.php?lang=en&modules=ext.gadget.site-styles&only=styles&skin=timeless is the css one [11:43:27] thx [11:43:43] But as a url, it was working fine for me as far as i can tell, and firefox network tab showed it loading fine. There was just no css on the page [11:44:42] As far as i can tell, all the network transfer stuff worked fine (Or if something went wrong, its non-obvious). But the page was still broken [11:44:52] yup [11:44:59] vgutierrez@cp4032:~$ curl -v -s "https://www.mediawiki.org/w/load.php?lang=en&modules=ext.gadget.site-styles&only=styles&skin=timeless" -o /dev/null 2>&1|egrep "(content-length|content-type|x-cache)" [11:44:59] < content-type: text/css; charset=utf-8 [11:44:59] < x-content-type-options: nosniff [11:44:59] < x-cache: cp4029 miss, cp4032 hit/2 [11:44:59] < x-cache-status: hit-front [11:45:00] < content-length: 20765 [11:45:12] same result as cp3052 [11:45:23] different x-cache of course [11:45:23] :) [11:47:20] (03PS1) 10Ladsgroup: Triple the factor of WDQS lag to maxlag for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571705 (https://phabricator.wikimedia.org/T244722) [11:48:03] I wonder if its some sort of bug in firefox. According to dev console not a single css rule was applied, but network tab says that css file was loaded [11:48:45] Well seemed to stop happening in any case [11:49:28] bawolff: ok, let me repool the suspicious hosts and see if you're able to reproduce the issue again [11:49:36] !log repooling cp40[26,32] [11:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:03] done [11:50:22] try to reproduce it again and we'd love a full capture of your requests to debug the issue [11:51:17] I have a HAR archive from the original time, if that helps [11:51:38] yep [11:51:48] try to reproduce again though :) [11:53:46] So yeah, doesn't seem to be happening anymore [11:53:53] !log mangle sessionstore on mw1331 so that it is unreachable. Testing for T243106 [11:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:57] T243106: Phased rollout of sessionstore to production fleet - https://phabricator.wikimedia.org/T243106 [11:54:36] bawolff: weird, could you upload/send us (ema and me) the HAR? [11:56:56] vgutierrez: https://phabricator.wikimedia.org/F31609899 [11:57:51] !log pool cp20[19,24] running buster - T242093 [11:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:55] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [11:57:59] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump the version [puppet] - 10https://gerrit.wikimedia.org/r/571703 (owner: 10Jbond) [11:57:59] bawolff: thx <3 [11:59:05] And just playing around with dev console, it looks like some of the js loaded, as there is a mw object there [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200212T1200). [12:00:05] kart_ and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:16] * kart_ is here.. [12:00:42] although, the dev console also shows the css in the style tab (but not the inspector tab), so idk [12:01:06] I'll go with my patch.. [12:01:16] !log depool cp20[16,22] and reimage as buster - T242093 [12:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:30] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [12:01:42] (03PS2) 10KartikMistry: Enable CX out of beta in bs and mk WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571412 (https://phabricator.wikimedia.org/T244139) [12:02:41] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2022.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121202_vgutie... [12:04:00] (03PS6) 10Ema: vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) [12:04:46] (03CR) 10KartikMistry: [C: 03+2] Enable CX out of beta in bs and mk WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571412 (https://phabricator.wikimedia.org/T244139) (owner: 10KartikMistry) [12:05:57] 10Operations, 10Research, 10Traffic, 10Patch-For-Review: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10bmansurov) @BBlack the site has been updated. Please turn on the DNS. [12:06:01] (03Merged) 10jenkins-bot: Enable CX out of beta in bs and mk WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571412 (https://phabricator.wikimedia.org/T244139) (owner: 10KartikMistry) [12:06:38] !log pool cp2016 running buster - T242093 [12:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:43] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [12:07:35] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [12:08:18] !log depool cp2013 and reimage as buster - T242093 [12:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:53] bawolff, ema: I cannot see anything wrong on the HAR... ema could you check it? [12:09:05] BTW since FF 71 you can import HAR files :) [12:09:55] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2013.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121209_vgutie... [12:10:32] o/ [12:11:43] (03PS12) 10Giuseppe Lavagetto: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [12:11:59] Amir1: finishing deploy soon.. [12:12:07] All good [12:12:52] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|571412|Enable ContentTranslation out of beta in bs and mk WPs (T244139, T244140)]] (duration: 01m 15s) [12:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:57] T244139: Enable Content Translation in Bosnian Wikipedia as a default tool - https://phabricator.wikimedia.org/T244139 [12:12:57] T244140: Enable Content Translation in Macedonian Wikipedia as a default tool - https://phabricator.wikimedia.org/T244140 [12:13:56] Amir1: done. Go ahead with your patch.. [12:14:03] kart_: thanks! [12:14:29] (03PS2) 10Ladsgroup: Triple the factor of WDQS lag to maxlag for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571705 (https://phabricator.wikimedia.org/T244722) [12:15:21] (03CR) 10Ladsgroup: [C: 03+2] Triple the factor of WDQS lag to maxlag for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571705 (https://phabricator.wikimedia.org/T244722) (owner: 10Ladsgroup) [12:16:17] (03Merged) 10jenkins-bot: Triple the factor of WDQS lag to maxlag for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571705 (https://phabricator.wikimedia.org/T244722) (owner: 10Ladsgroup) [12:17:08] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:39] (03PS7) 10Ema: vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) [12:19:22] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:571705|Triple the factor of WDQS lag to maxlag for Wikidata (T244722)]] (duration: 01m 04s) [12:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:26] T244722: increase factor for query service that is taken into account for maxlag - https://phabricator.wikimedia.org/T244722 [12:19:27] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:46] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2022.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2022.codfw.wmnet'] ` [12:21:36] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:571705|Triple the factor of WDQS lag to maxlag for Wikidata (T244722)]], take II, the cache issue (duration: 01m 03s) [12:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:13] (03PS8) 10Ema: vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) [12:24:20] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [12:26:38] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:24] (03PS1) 10MarcoAurelio: Localise $wgMetaNamespace for mywiki and mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571712 [12:32:26] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2013.codfw.wmnet'] ` and were **ALL** successful. [12:34:34] (03PS2) 10MarcoAurelio: Localise $wgMetaNamespace for mywiki and mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571712 (https://phabricator.wikimedia.org/T244980) [12:34:55] !log pool cp20[13,22] running buster - T242093 [12:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:59] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [12:35:42] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [12:35:57] !log depool cp20[12,20] and reimage as buster - T242093 [12:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:29] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2012.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121236_vgutie... [12:39:08] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2020.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121239_vgutie... [12:41:35] (03PS1) 10Jbond: puppet_compile: remove submodule support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/571713 [12:42:52] 10Operations, 10netops: eqsin (1) MX204 router - https://phabricator.wikimedia.org/T245000 (10ayounsi) p:05Triage→03Medium [12:45:58] (03PS1) 10Giuseppe Lavagetto: mediawiki: disable forensic logs on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/571715 [12:46:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: disable forensic logs on the canaries [puppet] - 10https://gerrit.wikimedia.org/r/571715 (owner: 10Giuseppe Lavagetto) [12:50:55] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:02] !log cr1-eqsin RE failover - T244944 [12:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:05] T244944: cr1-eqsin routing engine crashlooping after JunOS upgrade - https://phabricator.wikimedia.org/T244944 [12:53:10] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:34] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:52] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:43] (03CR) 10Lucas Werkmeister (WMDE): "What’s the status of this? Is it blocked on anything?" [puppet] - 10https://gerrit.wikimedia.org/r/526757 (https://phabricator.wikimedia.org/T222321) (owner: 10Smalyshev) [13:05:15] !log pool cp20[12,20] running buster - T242093 [13:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:21] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [13:05:55] !log depool cp20[10,18] and reimage as buster - T242093 [13:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:37] (03PS1) 10Muehlenhoff: Remove DNS entries for bohrium [dns] - 10https://gerrit.wikimedia.org/r/571718 [13:08:35] !log uploaded libapache2-mod-auth-cas 1.2-1~deb8u1 for jessie-wikimedia to apt.wikimedia.org [13:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:06] (03CR) 10Muehlenhoff: [C: 03+2] Remove DNS entries for bohrium [dns] - 10https://gerrit.wikimedia.org/r/571718 (owner: 10Muehlenhoff) [13:10:46] !log upgrading debdeploy fleet-wide to 0.0.99.13 [13:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:12] (03PS1) 10Vgutierrez: Revert "Revert "ATS: Disable KA on cp4031"" [puppet] - 10https://gerrit.wikimedia.org/r/571719 [13:11:44] (03PS2) 10Vgutierrez: Revert "Revert "ATS: Disable KA on cp4031"" [puppet] - 10https://gerrit.wikimedia.org/r/571719 (https://phabricator.wikimedia.org/T244464) [13:13:06] (03CR) 10Vgutierrez: [C: 03+2] Revert "Revert "ATS: Disable KA on cp4031"" [puppet] - 10https://gerrit.wikimedia.org/r/571719 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [13:15:52] !log disabling KA between ats-tls and varnish-fe on cp4031 - T244464 [13:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:56] T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 [13:16:31] fyi, re1:cr1-eqsin is upgraded and re0 is in the process right now [13:16:51] <3 [13:17:06] (03PS9) 10Ema: vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) [13:18:08] !log setting up db1140 under maintenance (upgrade, reboot, disable alerts) [13:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:52] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:31] !log Restart wikibugs as phab comments aren't showing up on irc - T241109 [13:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:35] T241109: wikibugs needs restart almost everyday - https://phabricator.wikimedia.org/T241109 [13:22:28] (03PS2) 10Jcrespo: backups: Migrate db1095 backup source instances (s2, s3) to db1140 [puppet] - 10https://gerrit.wikimedia.org/r/571693 (https://phabricator.wikimedia.org/T244958) [13:22:33] !log cr1-eqsin RE failover (final) - T244944 [13:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:37] T244944: cr1-eqsin routing engine crashlooping after JunOS upgrade - https://phabricator.wikimedia.org/T244944 [13:22:51] Not ready for mastership switch, try after 38 secs. [13:23:08] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:46] !log mangle sessionstore on mw1331, mw1348 so that it timesout instead of returning TCP RSTs. Testing for T243106 [13:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:50] T243106: Phased rollout of sessionstore to production fleet - https://phabricator.wikimedia.org/T243106 [13:24:38] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:36] (03CR) 10Jcrespo: [C: 03+2] backups: Migrate db1095 backup source instances (s2, s3) to db1140 [puppet] - 10https://gerrit.wikimedia.org/r/571693 (https://phabricator.wikimedia.org/T244958) (owner: 10Jcrespo) [13:26:48] alright BGP: 73199 routes ad going (slowly) up [13:26:51] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:09] (03PS2) 10Jcrespo: backups: Disable db1140 notifications [puppet] - 10https://gerrit.wikimedia.org/r/571696 (https://phabricator.wikimedia.org/T244958) [13:27:59] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2010.codfw.wmnet'] ` and were **ALL** successful. [13:28:05] (03PS10) 10Ema: vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) [13:28:21] (03CR) 10Jcrespo: [C: 03+2] backups: Disable db1140 notifications [puppet] - 10https://gerrit.wikimedia.org/r/571696 (https://phabricator.wikimedia.org/T244958) (owner: 10Jcrespo) [13:32:40] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2018.codfw.wmnet'] ` and were **ALL** successful. [13:33:21] alright v4 is done, v6 is going up, so looks like it's fixed [13:34:01] Oh well, i got the css failure again :S [13:34:06] (03CR) 10Ema: "cache_text tests:" [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [13:36:32] !log re-enable transit/peering on cr1-eqsin - T244944 [13:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:36] T244944: cr1-eqsin routing engine crashlooping after JunOS upgrade - https://phabricator.wikimedia.org/T244944 [13:37:44] Also, i had a response with x-cache: cp4028 miss, cp4032 pass, but an Age: 16 [13:37:54] Shouldn't non-cached responses not have an age header? [13:39:28] !log revert sessionstore on mw1331, mw1348 so that it times out instead of returning TCP RSTs. Testing for T243106 [13:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:32] T243106: Phased rollout of sessionstore to production fleet - https://phabricator.wikimedia.org/T243106 [13:39:47] (03PS2) 10Ema: vcl: simplify backend listing [puppet] - 10https://gerrit.wikimedia.org/r/571697 (https://phabricator.wikimedia.org/T241239) [13:40:07] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10Marostegui) @jcrespo @akosiaris any tentative date? [13:40:57] 10Operations, 10netops: cr1-eqsin routing engine crashlooping after JunOS upgrade - https://phabricator.wikimedia.org/T244944 (10ayounsi) 05Open→03Resolved Looks solved. [13:41:22] 10Operations, 10DBA, 10Phabricator, 10Release-Engineering-Team (Development services): Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 (10Marostegui) 12th (the original date I suggested) has passed, any tentative date @mmodell you'd like to consider, there i... [13:41:43] (03PS3) 10Ema: vcl: simplify backend listing [puppet] - 10https://gerrit.wikimedia.org/r/571697 (https://phabricator.wikimedia.org/T241239) [13:42:53] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10jcrespo) Sorry, I thought I had answered, but I apparently I did not hit submit. Any time during the UTC day, outside of the first 1 week of a month is ok for bacula. Preferably,... [13:43:02] I guess the age header just reflects how long the backend took to respond, and for some reason that request the backend took an absurdly long time [13:43:41] 10Operations: Integrate Stretch 9.12 point update - https://phabricator.wikimedia.org/T244695 (10MoritzMuehlenhoff) [13:44:41] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10Marostegui) Let's aim for Thursday 20th at 09:00AM UTC? [13:45:10] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10akosiaris) >>! In T244238#5876636, @Marostegui wrote: > @jcrespo @akosiaris any tentative date? Anytime is good for etherpad! [13:46:13] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10jcrespo) >>! In T244238#5876670, @Marostegui wrote: > Let's aim for Thursday 20th at 09:00AM UTC? Cool to me, send some invites this way! :-D [13:46:21] I also got the no css loaded on enwikipedia, so its not just mediawiki.org. It seems to be happening rather rarely now, like one out of every 30 page loads [13:47:24] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10Marostegui) >>! In T244238#5876672, @jcrespo wrote: >>>! In T244238#5876670, @Marostegui wrote: >> Let's aim for Thursday 20th at 09:00AM UTC? > > Cool to me, send some invites th... [13:54:40] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10Marostegui) Email: https://lists.wikimedia.org/pipermail/wikitech-l/2020-February/093063.html [13:55:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10393 and previous config saved to /var/cache/conftool/dbconfig/20200212-135514-marostegui.json [13:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:19] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [14:00:35] !log pool cp20[10,18] running buster - T242093 [14:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:39] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [14:01:05] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [14:03:55] 10Operations, 10Traffic, 10Patch-For-Review: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) 05Open→03Stalled ATS is having issues handling properly the connect and the TTFB timeout when KA is enabled and parent proxies are... [14:03:58] 10Operations, 10Traffic: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) [14:16:50] (03PS1) 10Ottomata: Fix alias in otto .bash_aliases [puppet] - 10https://gerrit.wikimedia.org/r/571723 [14:16:59] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10serviceops: WanObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10jbond) p:05Triage→03Medium [14:17:50] 10Operations, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) eqiad backup service has been restored on a different host, now to handle hw issues. [14:20:07] (03CR) 10Ottomata: [C: 03+2] Fix alias in otto .bash_aliases [puppet] - 10https://gerrit.wikimedia.org/r/571723 (owner: 10Ottomata) [14:20:41] (03PS1) 10Jcrespo: Revert "backups: Disable db1140 notifications" [puppet] - 10https://gerrit.wikimedia.org/r/571725 [14:21:38] (03PS1) 10Ottomata: Fix stats.wikimedia.org/v2 redirect [puppet] - 10https://gerrit.wikimedia.org/r/571726 (https://phabricator.wikimedia.org/T237752) [14:23:00] (03CR) 10Ottomata: [C: 03+2] Fix stats.wikimedia.org/v2 redirect [puppet] - 10https://gerrit.wikimedia.org/r/571726 (https://phabricator.wikimedia.org/T237752) (owner: 10Ottomata) [14:23:15] (03PS1) 10Ema: varnish::instance: remove 'layer' [puppet] - 10https://gerrit.wikimedia.org/r/571727 (https://phabricator.wikimedia.org/T241239) [14:26:42] arturo: I'll reply on task [14:27:42] (03CR) 10CDanis: [C: 03+1] Revert "Depool eqsin for router upgrade" [dns] - 10https://gerrit.wikimedia.org/r/571607 (owner: 10Ayounsi) [14:28:44] (03PS1) 10Elukey: admin: add kerberos flag to user fsalutari [puppet] - 10https://gerrit.wikimedia.org/r/571729 (https://phabricator.wikimedia.org/T245024) [14:29:29] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag to user fsalutari [puppet] - 10https://gerrit.wikimedia.org/r/571729 (https://phabricator.wikimedia.org/T245024) (owner: 10Elukey) [14:29:53] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10Volans) 05Resolved→03Open Re-opening as it's currently alerting. It's all documented in the link that is associated with the alert itself: https://wikitec... [14:29:57] arturo: ^^^ [14:30:39] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool eqsin for router upgrade" [dns] - 10https://gerrit.wikimedia.org/r/571607 (owner: 10Ayounsi) [14:30:43] (03PS2) 10Ayounsi: Revert "Depool eqsin for router upgrade" [dns] - 10https://gerrit.wikimedia.org/r/571607 [14:31:16] (03PS2) 10Jcrespo: Revert "backups: Disable db1140 notifications" [puppet] - 10https://gerrit.wikimedia.org/r/571725 [14:31:29] !log reimage logstash2026 to test new standard RAID0 partman recipe [14:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:50] !log repool eqsin [14:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:45] 10Operations, 10netops: Upgrade routers - https://phabricator.wikimedia.org/T243080 (10ayounsi) [14:36:02] PROBLEM - Host logstash2026 is DOWN: PING CRITICAL - Packet loss = 100% [14:37:12] ^ reimage, downtime race [14:37:54] RECOVERY - Host logstash2026 is UP: PING OK - Packet loss = 0%, RTA = 36.29 ms [14:42:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Hardware): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10JHedden) 05Open→03Resolved Closing this for now, I'll open it back up if it fails again. [14:44:40] (03PS1) 10Ayounsi: Ignore MX104 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/571731 [14:44:47] (03CR) 10Jcrespo: [C: 03+2] Revert "backups: Disable db1140 notifications" [puppet] - 10https://gerrit.wikimedia.org/r/571725 (owner: 10Jcrespo) [14:45:06] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 48.02 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:47:10] (03PS1) 10Muehlenhoff: Add IDP service definition for Tendril [puppet] - 10https://gerrit.wikimedia.org/r/571732 [14:58:50] ACKNOWLEDGEMENT - Device not healthy -SMART- on cloudvirt1009 is CRITICAL: cluster=wmcs device=cciss,17 instance=cloudvirt1009:9100 job=node site=eqiad Jhedden T244986 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1009&var-datasource=eqiad+prometheus/ops [15:01:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/571732 (owner: 10Muehlenhoff) [15:02:08] !log depool cp20[07,17] and reimage as buster - T242093 [15:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:13] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [15:03:07] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2007.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121502_vgutie... [15:03:44] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (vega, ...), Fresh: 84 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [15:05:12] ^checking [15:05:21] (03PS1) 10Jbond: graphite: update certificate to enclude cas-graphite alt name [puppet] - 10https://gerrit.wikimedia.org/r/571734 (https://phabricator.wikimedia.org/T244861) [15:06:36] (03CR) 10Muehlenhoff: [C: 03+2] Add IDP service definition for Tendril [puppet] - 10https://gerrit.wikimedia.org/r/571732 (owner: 10Muehlenhoff) [15:06:57] (03PS2) 10Jbond: graphite: update certificate to enclude cas-graphite alt name [puppet] - 10https://gerrit.wikimedia.org/r/571734 (https://phabricator.wikimedia.org/T244861) [15:07:53] (03CR) 10jerkins-bot: [V: 04-1] graphite: update certificate to enclude cas-graphite alt name [puppet] - 10https://gerrit.wikimedia.org/r/571734 (https://phabricator.wikimedia.org/T244861) (owner: 10Jbond) [15:07:57] 10Operations, 10Core Platform Team, 10serviceops, 10Performance-Team (Radar), 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10Joe) >>! In T244058#5872430, @Anomie wrote: >>>! In T244058#5861825, @Joe wrote: >> Sure, they do, but IIRC that limi... [15:08:44] (03PS3) 10Jbond: graphite: update certificate to enclude cas-graphite alt name [puppet] - 10https://gerrit.wikimedia.org/r/571734 (https://phabricator.wikimedia.org/T244861) [15:10:36] (03PS5) 10Clarakosi: hiera: Add restbase202[123] to hiera [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178) [15:11:34] (03CR) 10Clarakosi: "> In the subject of comments we usually add the puppet module or software this commit is about" [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178) (owner: 10Clarakosi) [15:12:10] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2017.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121512_vgutie... [15:15:00] (03PS1) 10Vgutierrez: ATS: Test KA on cp4031 whilst disable parent proxies are disabled [puppet] - 10https://gerrit.wikimedia.org/r/571736 (https://phabricator.wikimedia.org/T244464) [15:15:31] (03PS2) 10Vgutierrez: ATS: Test KA on cp4031 whilst parent proxies are disabled [puppet] - 10https://gerrit.wikimedia.org/r/571736 (https://phabricator.wikimedia.org/T244464) [15:16:44] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:17:34] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:49] (03CR) 10jerkins-bot: [V: 04-1] ATS: Test KA on cp4031 whilst parent proxies are disabled [puppet] - 10https://gerrit.wikimedia.org/r/571736 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [15:19:53] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: improve Ganeti VM support [cookbooks] - 10https://gerrit.wikimedia.org/r/567169 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [15:19:53] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:51] 10Operations, 10ops-eqiad: Degraded RAID on db1095 - https://phabricator.wikimedia.org/T245031 (10ops-monitoring-bot) [15:20:56] (03PS1) 10Itamar Givon: Add definitions for redirect badges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571738 (https://phabricator.wikimedia.org/T235420) [15:21:12] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2007.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2007.codfw.wmnet'] ` [15:22:57] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops: siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Joe) >>! In T244204#5848838, @Anomie wrote: [cut] > > As for actually implementing this: HTTP caching in the Action A... [15:24:59] !log upgrade BIOS firmware on cloudvirt1024 to 2.4.8 T241884 [15:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:03] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [15:26:35] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:49] (03PS3) 10Jbond: ATS: Test KA on cp4031 whilst parent proxies are disabled [puppet] - 10https://gerrit.wikimedia.org/r/571736 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [15:28:54] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:57] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2017.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2017.codfw.wmnet'] ` [15:30:27] (03PS3) 10Volans: sre.hosts.decommission: improve Ganeti VM support [cookbooks] - 10https://gerrit.wikimedia.org/r/567169 (https://phabricator.wikimedia.org/T231068) [15:31:08] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Lea_Lacroix_WMDE) We just [[ https://phabricator.wikimedia.org/T244722 | increased the factor to 180 ]]... [15:32:01] 10Operations, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10Marostegui) [15:32:03] 10Operations, 10ops-eqiad: Degraded RAID on db1095 - https://phabricator.wikimedia.org/T245031 (10Marostegui) [15:32:21] !log Disable event handler for db1095 RAID check on icinga - [15:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:31] !log Disable event handler for db1095 RAID check on icinga - T244958 [15:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:35] T244958: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 [15:33:11] (03PS4) 10Vgutierrez: ATS: Test KA on cp4031 whilst parent proxies are disabled [puppet] - 10https://gerrit.wikimedia.org/r/571736 (https://phabricator.wikimedia.org/T244464) [15:35:06] (03PS5) 10Vgutierrez: ATS: Test KA on cp4031 whilst parent proxies are disabled [puppet] - 10https://gerrit.wikimedia.org/r/571736 (https://phabricator.wikimedia.org/T244464) [15:37:41] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:52] (03PS6) 10Vgutierrez: ATS: Test KA on cp4031 whilst parent proxies are disabled [puppet] - 10https://gerrit.wikimedia.org/r/571736 (https://phabricator.wikimedia.org/T244464) [15:39:38] !log clearing foreign drive RAID configuration on cloudvirt1024 T241884 [15:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:42] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [15:40:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:59] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1003/20773/" [puppet] - 10https://gerrit.wikimedia.org/r/571736 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [15:41:08] 10Operations, 10Traffic, 10serviceops: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Joe) [15:43:52] (03CR) 10Vgutierrez: [C: 03+1] vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [15:46:58] !log authdns2001 - shutting down for hardware work - T242017 [15:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:26] 10Operations, 10Android-app-Bugs, 10Traffic, 10Wikipedia-Android-App-Backlog, and 3 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Joe) [15:47:28] (03CR) 10ArielGlenn: "Given that creativecommons redirects folks to https, maybe it's better to just explicitly specify that?" [puppet] - 10https://gerrit.wikimedia.org/r/571672 (https://phabricator.wikimedia.org/T244685) (owner: 10Fdans) [15:47:56] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T245035 (10ops-monitoring-bot) [15:48:33] !log pool cp20[07,17] running buster - T242093 [15:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:37] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [15:48:42] (03CR) 10Ema: [C: 03+1] ATS: Test KA on cp4031 whilst parent proxies are disabled [puppet] - 10https://gerrit.wikimedia.org/r/571736 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [15:49:02] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T245035 (10JHedden) [15:49:04] !log spicerack upgraded to 0.0.30-1 on both cumin hosts [15:49:06] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) [15:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:48] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [15:50:13] !log depool cp20[06,14] and reimage as buster - T242093 [15:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:42] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2006.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121550_vgutie... [15:50:47] 10Operations, 10Android-app-Bugs, 10Traffic, 10Wikipedia-Android-App-Backlog, and 3 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Joe) [15:51:33] (03CR) 10Jbond: [C: 03+2] graphite: update certificate to enclude cas-graphite alt name [puppet] - 10https://gerrit.wikimedia.org/r/571734 (https://phabricator.wikimedia.org/T244861) (owner: 10Jbond) [15:51:41] (03CR) 10Vgutierrez: [C: 03+2] ATS: Test KA on cp4031 whilst parent proxies are disabled [puppet] - 10https://gerrit.wikimedia.org/r/571736 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [15:53:43] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2014.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002121553_vgutie... [15:54:54] PROBLEM - Host authdns2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:56:16] !log Enable KA and disable parent proxies on cp4031 - T244464 [15:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:20] T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 [15:56:34] PROBLEM - Zookeeper Alive Client Connections too high on an-conf1001 is CRITICAL: 1091 ge 1024 https://wikitech.wikimedia.org/wiki/Zookeeper https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=6&fullscreen [15:56:41] (03PS3) 10Fdans: dumps:Add license footer to all analytics dumps pages [puppet] - 10https://gerrit.wikimedia.org/r/571672 (https://phabricator.wikimedia.org/T244685) [15:59:04] 10Operations, 10Android-app-Bugs, 10Traffic, 10Wikipedia-Android-App-Backlog, and 3 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10Dbrant) @Joe Thanks for that! We'll update our code asap. Since we use the `si... [15:59:32] RECOVERY - Zookeeper Alive Client Connections too high on an-conf1001 is OK: (C)1024 ge (W)512 ge 0 https://wikitech.wikimedia.org/wiki/Zookeeper https://grafana.wikimedia.org/dashboard/db/zookeeper?refresh=5m&orgId=1&panelId=6&fullscreen [16:00:32] RECOVERY - Host authdns2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.15 ms [16:02:38] (03PS1) 10Jbond: profile::graphite::production correct idp client web template [puppet] - 10https://gerrit.wikimedia.org/r/571747 (https://phabricator.wikimedia.org/T244861) [16:03:12] (03PS2) 10Jbond: profile::graphite::production correct idp client web template [puppet] - 10https://gerrit.wikimedia.org/r/571747 (https://phabricator.wikimedia.org/T244861) [16:03:19] 10Operations, 10conftool: Enforce in dbctl that core sections and es clusters always have at least two replicas - https://phabricator.wikimedia.org/T245036 (10Marostegui) [16:03:47] (03PS1) 10Giuseppe Lavagetto: mw1261: enable forensic log on one host [puppet] - 10https://gerrit.wikimedia.org/r/571748 [16:05:09] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [16:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:27] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Cmjohnson) @Marostegui We can upgrade the f/w. That can be anytime, please pick a convenient date for you. [16:05:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw1261: enable forensic log on one host [puppet] - 10https://gerrit.wikimedia.org/r/571748 (owner: 10Giuseppe Lavagetto) [16:06:00] 10Operations, 10ops-eqiad, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) Please note the outage caused the SAL of my adding back cp108[67] to service, (as the rest weren't really returned, but depooled.) [16:06:15] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) >>! In T243963#5877162, @Cmjohnson wrote: > @Marostegui We can upgrade the f/w. That can be anytime, please pick a convenient date for you. Can we do it tomorrow at the most conveni... [16:06:19] 10Operations, 10ops-eqiad, 10ops-esams, 10DC-Ops, 10Traffic: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 (10RobH) [16:06:35] (03PS3) 10Jbond: profile::graphite::production correct idp client web template [puppet] - 10https://gerrit.wikimedia.org/r/571747 (https://phabricator.wikimedia.org/T244861) [16:07:25] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:33] (03PS1) 10Muehlenhoff: Switch cloudnet2002-dev and cloudweb2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571749 (https://phabricator.wikimedia.org/T156955) [16:08:05] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [16:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:13] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) @Cmjohnson I can have the host depooled and off tomorrow in the UTC morning so you can do it whenever you can tomorrow, and once done, just power it back on. Would that work? [16:08:34] (03CR) 10jerkins-bot: [V: 04-1] Switch cloudnet2002-dev and cloudweb2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571749 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:10:01] (03PS1) 10Filippo Giunchedi: prometheus: bump snmp scrape timeout [puppet] - 10https://gerrit.wikimedia.org/r/571750 [16:10:25] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:33] (03PS2) 10Muehlenhoff: Switch cloudnet2002-dev and cloudweb2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571749 (https://phabricator.wikimedia.org/T156955) [16:11:37] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: bump snmp scrape timeout [puppet] - 10https://gerrit.wikimedia.org/r/571750 (owner: 10Filippo Giunchedi) [16:12:15] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2006.codfw.wmnet'] ` and were **ALL** successful. [16:13:18] (03CR) 10Jbond: [C: 03+2] profile::graphite::production correct idp client web template [puppet] - 10https://gerrit.wikimedia.org/r/571747 (https://phabricator.wikimedia.org/T244861) (owner: 10Jbond) [16:13:25] (03CR) 10Giuseppe Lavagetto: "The change is technically correct; I'd like to get confirmation that we really don't need gd anymore from a few developers too." [puppet] - 10https://gerrit.wikimedia.org/r/526255 (https://phabricator.wikimedia.org/T227734) (owner: 10Jforrester) [16:14:53] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2014.codfw.wmnet'] ` and were **ALL** successful. [16:15:56] 10Operations, 10ops-eqiad, 10DC-Ops: (2020-01-15) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T244886 (10jijiki) @RobH, those hosts are delivered and in production per T241795 [16:16:29] ohh [16:17:16] 10Operations, 10SRE-Access-Requests: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10Dwisehaupt) @jbond Yes, the public key I included in the request was created specifically for this access and is not shared/used for anything else. Here it is again: ssh-ed25519 AA... [16:17:55] 10Operations, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10RobH) [16:17:58] 10Operations, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10RobH) [16:18:13] 10Operations, 10ops-eqiad, 10DC-Ops: (2020-01-15) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T244886 (10RobH) 05Open→03Invalid >>! In T244886#5877202, @jijiki wrote: > @RobH, those hosts are delivered and in production per T241795 Indeed, it seems T241795 was set as... [16:19:07] 10Operations, 10SRE-Access-Requests: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10jbond) >>! In T244901#5877205, @Dwisehaupt wrote: > @jbond Yes, the public key I included in the request was created specifically for this access and is not shared/used for anything... [16:19:25] 10Operations, 10SRE-Access-Requests: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10jbond) [16:20:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10aborrero) Mention: {T244986} [16:21:05] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: allow varying the slowlog limit [puppet] - 10https://gerrit.wikimedia.org/r/570256 [16:21:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] Migrate changeprop & cpjobqueue to kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [16:24:28] 10Operations, 10SRE-Access-Requests: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10Dwisehaupt) Not a problem. Thanks for your help with this. [16:25:17] (03CR) 10Ema: [C: 03+1] cache: add new mapping for cas-graphite [puppet] - 10https://gerrit.wikimedia.org/r/571491 (https://phabricator.wikimedia.org/T244861) (owner: 10Jbond) [16:25:42] (03PS1) 10Jgreen: adjust frack nsca monitoring: remove americium, add frban1001 [puppet] - 10https://gerrit.wikimedia.org/r/571752 (https://phabricator.wikimedia.org/T234068) [16:26:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::php: allow varying the slowlog limit [puppet] - 10https://gerrit.wikimedia.org/r/570256 (owner: 10Giuseppe Lavagetto) [16:26:26] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Jgreen) [16:27:01] (03PS1) 10Jbond: cas-graphite: add new domain to dns [dns] - 10https://gerrit.wikimedia.org/r/571753 [16:27:22] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Jgreen) [16:27:50] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Jgreen) 05Open→03Resolved Switched as of approximately 16:00 UTC today. [16:28:35] (03CR) 10Jbond: [C: 03+2] cache: add new mapping for cas-graphite [puppet] - 10https://gerrit.wikimedia.org/r/571491 (https://phabricator.wikimedia.org/T244861) (owner: 10Jbond) [16:28:56] (03PS3) 10Jbond: cache: add new mapping for cas-graphite [puppet] - 10https://gerrit.wikimedia.org/r/571491 (https://phabricator.wikimedia.org/T244861) [16:28:58] (03CR) 10Jgreen: [C: 03+2] adjust frack nsca monitoring: remove americium, add frban1001 [puppet] - 10https://gerrit.wikimedia.org/r/571752 (https://phabricator.wikimedia.org/T234068) (owner: 10Jgreen) [16:30:44] (03PS1) 10BBlack: authdns2001: update mac for 10G card [puppet] - 10https://gerrit.wikimedia.org/r/571758 (https://phabricator.wikimedia.org/T242017) [16:31:48] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission americium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T245038 (10Jgreen) [16:32:10] (03CR) 10BBlack: [C: 03+2] authdns2001: update mac for 10G card [puppet] - 10https://gerrit.wikimedia.org/r/571758 (https://phabricator.wikimedia.org/T242017) (owner: 10BBlack) [16:32:42] 10Operations, 10Core Platform Team, 10MediaWiki-API, 10Traffic: Evaluate the feasibility of cache invalidation for the action API - https://phabricator.wikimedia.org/T122867 (10Demian) [16:32:50] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission americium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T245038 (10Jgreen) Waiting a few days to make sure kafkatee is behaving properly on frban1001. [16:33:26] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission americium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T245038 (10Jgreen) [16:36:08] 10Operations, 10Elasticsearch, 10Traffic, 10Discovery-Search (Current work), and 2 others: Sustained periods (2-4h) of bad latency on production-search eqiad - https://phabricator.wikimedia.org/T241421 (10TJones) 05Open→03Resolved a:03TJones [16:42:35] Is phab borked for anyone else? [16:42:38] im not getting images [16:42:58] worked for me, what task? [16:43:30] yeah, all of it is borked for me [16:43:33] no images loading [16:43:37] time for cache clear and reboot. [16:43:49] got an example ticket with image? [16:43:59] 10Operations, 10DBA, 10Phabricator, 10Release-Engineering-Team (Development services): Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 (10mmodell) Hey @Marostegui, How about tomorrow? I can be around tomorrow, if you'd like. If you'd like to do it at your le... [16:44:07] i mean the sidebar images [16:44:10] the css stuff [16:44:12] ah [16:44:14] its all just text ;D [16:45:19] 10Operations, 10DBA, 10Phabricator, 10Release-Engineering-Team (Development services): Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 (10Marostegui) >>! In T244566#5877381, @mmodell wrote: > Hey @Marostegui, How about tomorrow? I can be around tomorrow, if... [16:48:03] 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10mark) @wiki_willy With Chris having been ill the past few days, what's a realistic new ETA for this? [16:49:17] !log installing openjpeg2 security updates [16:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:35] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [16:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:10] (03PS3) 10CRusnov: puppetdb: Improve structure and separate VMs and devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565176 [16:52:14] !log pool cp20[06,14] running buster - T242093 [16:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:21] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [16:52:40] 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) a:05jcrespo→03wiki_willy Battery of db1095, our of warranty, is toasted. It would be nice not throw away the whole server for just the RAID battery. Could we order one? For... [16:52:41] (03CR) 10Filippo Giunchedi: [C: 03+1] Switch cloudnet2002-dev and cloudweb2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571749 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:52:48] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [16:53:49] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:10] 10Operations: Integrate Buster 10.3 point update - https://phabricator.wikimedia.org/T244693 (10MoritzMuehlenhoff) [16:54:37] (03PS4) 10CRusnov: puppetdb: Improve structure and separate VMs and devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565176 [17:00:01] !log depool cp40[26,32] [17:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:50] !log rolling back cp4026 and cp4032 to trafficserver 8.0.5-1wm15 [17:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:34] (03PS1) 10Vgutierrez: Revert "ATS: Test KA on cp4031 whilst parent proxies are disabled" [puppet] - 10https://gerrit.wikimedia.org/r/571763 [17:07:28] (03PS2) 10Vgutierrez: Revert "ATS: Test KA on cp4031 whilst parent proxies are disabled" [puppet] - 10https://gerrit.wikimedia.org/r/571763 (https://phabricator.wikimedia.org/T244464) [17:08:01] PROBLEM - traffic_server backend process restarted on cp3052 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3052&var-layer=backend [17:08:59] 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10wiki_willy) @mark - just chatted with Chris and he's working on them now (he's back today after being sick), so ETA is end of day. Thanks, Willy [17:09:03] (03CR) 10Vgutierrez: [C: 03+2] Revert "ATS: Test KA on cp4031 whilst parent proxies are disabled" [puppet] - 10https://gerrit.wikimedia.org/r/571763 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [17:09:39] !log disabling KA between ats-tls and varnish-fe on cp4031 - T244464 [17:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:51] T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 [17:10:21] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Anomie) This blocks {T106363} which blocks {T106386}. If we want to do T106386 then this needs to be done. I do... [17:11:23] !log restarting jenkins for updates [17:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:37] 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10RobH) a:05wiki_willy→03Jclark-ctr Please note that we just ordered replacement raid batteries for HP Gen9 raid controllers via T243547. @jclark-ctr: Please use one of the batteries... [17:20:07] 10Operations, 10Wikibugs: wikibugs needs restart almost everyday - https://phabricator.wikimedia.org/T241109 (10valhallasw) Hmm, odd. ` 2020-02-12 12:43:01,182 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 58 2020-02-12 12:43:02,311 - urllib3.conn... [17:22:54] !log ns1.wikimedia.org - re-route back to original authdns2001 destination [17:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:19] !log upgrade RAID firmware on cloudvirt1024 to 25.5.6.0009 T241884 [17:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:23] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [17:35:56] I'm going to deploy a wmf.19 back-port. [17:36:44] jouncebot: next [17:36:44] In 1 hour(s) and 23 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200212T1900) [17:36:48] Cool. [17:37:45] (03CR) 10ArielGlenn: [C: 03+1] "Looks fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/571672 (https://phabricator.wikimedia.org/T244685) (owner: 10Fdans) [17:38:24] 10Operations, 10Security Concept Review, 10Security Supplier Assessments, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10HMarcus) Is there a reason why the SRE mx servers need to be the authority for... [17:39:32] 10Operations, 10Wikibugs: wikibugs needs restart almost everyday - https://phabricator.wikimedia.org/T241109 (10Marostegui) Thank you! [17:43:37] 10Operations, 10ops-eqsin: update PDUs for eqsin (asset tag and other info) - https://phabricator.wikimedia.org/T211368 (10RobH) 05Open→03Resolved Ok, this task was to track the old PDUs, which have been replaced. The new PDUs are in accounting sheet, but lack asset tags, as tracked by T244900. Those new... [17:48:01] (03CR) 10Joal: [C: 03+1] "Thanks Francisco :)" [puppet] - 10https://gerrit.wikimedia.org/r/571672 (https://phabricator.wikimedia.org/T244685) (owner: 10Fdans) [17:48:29] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10chasemp) [17:52:15] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 86 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [17:53:04] I don't know why incrementals had stopped [17:53:14] but it seems to have something to do with purges [17:54:02] I am generating a new full just in case [17:54:25] if it is affecting other hosts, alerting will tell [17:54:33] but things look of for now [17:55:33] 10Operations, 10ops-codfw, 10fundraising-tech-ops: codfw:fundraising single-cpu misc servers - https://phabricator.wikimedia.org/T244950 (10Jgreen) [17:57:12] 10Operations, 10ops-codfw, 10fundraising-tech-ops: codfw:fundraising single-cpu misc servers - https://phabricator.wikimedia.org/T244950 (10Jgreen) [17:57:21] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.19/includes/pager/IndexPager.php: T244941 IndexPager: Cast properties passed to implode to arrays (duration: 01m 03s) [17:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:26] T244941: IndexPager: Cast properties passed to implode to arrays - https://phabricator.wikimedia.org/T244941 [17:57:36] 10Operations, 10ops-codfw, 10fundraising-tech-ops: codfw:fundraising single-cpu misc servers - https://phabricator.wikimedia.org/T244950 (10Jgreen) a:05Jgreen→03Papaul [17:57:54] 10Operations, 10Goal: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) Apparently, databases pool got enlarged, but production one is still on 1 month to purge. Needs checking to increase it too to 3 months, there is space available for that. [18:03:01] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10Dzahn) >>! In T244792#5877631, @HMarcus wrote: > the mail team is already offloading aliases from the mx servers to Google per T122144 That... [18:04:39] 10Operations, 10Goal: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) For some reason, the pool was updated, but not every volumne. I run update pool from resource, and then "all volumnes from pool", and it got applied. Will need to monitor... [18:08:16] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) I am wondering if 0 byte incremental backups are purged as an optimization, or we had another error that caused those to disapper. If that were... [18:28:07] 10Operations, 10ops-codfw, 10fundraising-tech-ops: codfw:fundraising single-cpu misc servers - https://phabricator.wikimedia.org/T244950 (10Papaul) p:05Triage→03Medium [18:31:39] !log irc2001 - manually run the "${v6_token_cmd} && ${v6_flush_dyn_cmd}" commands from interface::add_ip6_mapped to debug 'Interface::Add_ip6_mapped[main]/Augeas[ens5_v6_token]: Could not evaluate: Saving failed' but it does not reproduce the puppet error ... (T244719) [18:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:43] T244719: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 [18:31:45] (03PS1) 10BryanDavis: k8s shell: fix logic bug in guard for new Pod creation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/571778 (https://phabricator.wikimedia.org/T244954) [18:31:50] 10Operations, 10DBA, 10Phabricator, 10Release-Engineering-Team (Development services): Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 (10mmodell) @Marostegui Thursday 6:00 AM works for me. [18:32:02] ./away .. will be back later. gotta move [18:32:08] (03PS5) 10CRusnov: puppetdb: Improve structure and separate VMs and devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565176 [18:32:34] (03PS6) 10CRusnov: puppetdb: Improve structure and separate VMs and devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565176 [18:33:02] 10Operations, 10DBA, 10Phabricator, 10Release-Engineering-Team (Development services): Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 (10Marostegui) >>! In T244566#5877826, @mmodell wrote: > @Marostegui Thursday 6:00 AM works for me. Excellent! See you to... [18:33:05] (03CR) 10Bstorm: [C: 03+1] k8s shell: fix logic bug in guard for new Pod creation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/571778 (https://phabricator.wikimedia.org/T244954) (owner: 10BryanDavis) [18:34:40] (03CR) 10BryanDavis: [C: 03+2] k8s shell: fix logic bug in guard for new Pod creation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/571778 (https://phabricator.wikimedia.org/T244954) (owner: 10BryanDavis) [18:36:05] (03CR) 10Volans: [C: 03+1] "Discussed and amended on IRC. LGTM for what can be checked here, this is not easily testable in the test instance of Netbox." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565176 (owner: 10CRusnov) [18:36:07] (03Merged) 10jenkins-bot: k8s shell: fix logic bug in guard for new Pod creation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/571778 (https://phabricator.wikimedia.org/T244954) (owner: 10BryanDavis) [18:36:32] (03CR) 10CRusnov: [C: 03+2] puppetdb: Improve structure and separate VMs and devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/565176 (owner: 10CRusnov) [18:40:04] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10kchapman) @jcrespo do you still want us to do the compression in T106386? Are the storage constraints still rele... [18:42:25] 10Operations, 10Analytics, 10serviceops, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10Dzahn) ` Debug: Augeas[ens5_v6_token](provider=augeas): sending command 'set' with params ["/files/etc/network/interfaces/iface[. = 'ens5']/pre... [18:45:22] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10MBinder_WMF) Hi, folks! I haven't heard much about this recently, and wanted to see what remained to be done, and if I could help unblock anything. :) [18:46:33] (03PS1) 10Volans: ganeti: use canonical cluster names [software/spicerack] - 10https://gerrit.wikimedia.org/r/571780 (https://phabricator.wikimedia.org/T231068) [18:46:34] 10Operations, 10Analytics, 10serviceops, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10Dzahn) The primary network interface is missing from /etc/network/interfaces. There is only loopback in there. Why that is is another question.... [18:47:26] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Marostegui) >>! In T107610#5877888, @kchapman wrote: > @jcrespo do you still want us to do the compression in T1... [18:48:23] (03PS1) 10BryanDavis: d/changelog: prepare 0.63 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/571781 [18:50:11] (03CR) 10BryanDavis: [V: 03+1 C: 03+2] "Created with `gbp dch --git-author --id-length=7 --since=68afa79`" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/571781 (owner: 10BryanDavis) [18:53:17] (03Merged) 10jenkins-bot: d/changelog: prepare 0.63 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/571781 (owner: 10BryanDavis) [19:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200212T1900). Please do the needful. [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:04:06] (03CR) 10Dzahn: [C: 03+2] introduce new role to install nginx and APT repo without DHCP/TFTP [puppet] - 10https://gerrit.wikimedia.org/r/570971 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [19:04:16] (03PS2) 10Dzahn: introduce new role to install nginx and APT repo without DHCP/TFTP [puppet] - 10https://gerrit.wikimedia.org/r/570971 (https://phabricator.wikimedia.org/T224576) [19:08:29] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10HMarcus) Thanks for clarifying that Daniel, and I was actually going to follow up on that ticket with you this week. That's very helpful, and... [19:08:33] (03PS2) 10Jforrester: Replace deprecated IP class with IPUtils [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567193 (https://phabricator.wikimedia.org/T242556) (owner: 10Ammarpad) [19:08:44] (03CR) 10Jforrester: [C: 03+2] Replace deprecated IP class with IPUtils [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567193 (https://phabricator.wikimedia.org/T242556) (owner: 10Ammarpad) [19:09:52] (03Merged) 10jenkins-bot: Replace deprecated IP class with IPUtils [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567193 (https://phabricator.wikimedia.org/T242556) (owner: 10Ammarpad) [19:11:41] (03PS1) 10Cmjohnson: Adding mgmt dns for mw13[49-84].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/571782 (https://phabricator.wikimedia.org/T236437) [19:12:02] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for mw13[49-84].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/571782 (https://phabricator.wikimedia.org/T236437) (owner: 10Cmjohnson) [19:12:27] (03PS3) 10Jforrester: [fywiktionary] Set a local wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562517 (https://phabricator.wikimedia.org/T241883) (owner: 10Gerrit Patch Uploader) [19:12:29] !log jforrester@deploy1001 Synchronized wmf-config/throttle-analyze.php: Replace deprecated IP class with IPUtils (no-op sync) (duration: 01m 03s) [19:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:03] (03CR) 10Jforrester: [C: 03+2] [fywiktionary] Set a local wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562517 (https://phabricator.wikimedia.org/T241883) (owner: 10Gerrit Patch Uploader) [19:15:05] (03PS2) 10Cmjohnson: Adding mgmt dns for mw13[49-84].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/571782 (https://phabricator.wikimedia.org/T236437) [19:15:26] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for mw13[49-84].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/571782 (https://phabricator.wikimedia.org/T236437) (owner: 10Cmjohnson) [19:15:59] (03Merged) 10jenkins-bot: [fywiktionary] Set a local wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562517 (https://phabricator.wikimedia.org/T241883) (owner: 10Gerrit Patch Uploader) [19:18:37] (03PS3) 10Cmjohnson: Adding mgmt dns for mw13[49-84].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/571782 (https://phabricator.wikimedia.org/T236437) [19:19:39] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T241883 [fywiktionary] Set a local wgSitename (duration: 01m 03s) [19:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:43] T241883: Introducing $wgSitename and $wgGrammarForms on fy.wiktionary.org - https://phabricator.wikimedia.org/T241883 [19:23:44] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for mw13[49-84].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/571782 (https://phabricator.wikimedia.org/T236437) (owner: 10Cmjohnson) [19:24:23] (03PS1) 10CRusnov: netbox: Configure report checks for PuppetDB report rearrangement [puppet] - 10https://gerrit.wikimedia.org/r/571783 [19:25:01] (03PS3) 10Jforrester: Change the timezone of newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571118 (https://phabricator.wikimedia.org/T244205) (owner: 10Tulsi Bhagat) [19:26:44] (03CR) 10Jforrester: [C: 03+2] Change the timezone of newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571118 (https://phabricator.wikimedia.org/T244205) (owner: 10Tulsi Bhagat) [19:27:23] (03PS3) 10Jforrester: Localise $wgMetaNamespace for mywiki and mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571712 (https://phabricator.wikimedia.org/T244980) (owner: 10MarcoAurelio) [19:27:41] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/571783 (owner: 10CRusnov) [19:27:46] (03Merged) 10jenkins-bot: Change the timezone of newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571118 (https://phabricator.wikimedia.org/T244205) (owner: 10Tulsi Bhagat) [19:29:57] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/571783 (owner: 10CRusnov) [19:30:14] (03CR) 10CRusnov: [C: 03+2] netbox: Configure report checks for PuppetDB report rearrangement [puppet] - 10https://gerrit.wikimedia.org/r/571783 (owner: 10CRusnov) [19:30:17] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T244205 [newiki] Set local timezone to Kathmandu (duration: 01m 03s) [19:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:22] T244205: Change the timezone of Nepali Wikipedia - https://phabricator.wikimedia.org/T244205 [19:31:38] (03CR) 10Jforrester: [C: 03+2] Localise $wgMetaNamespace for mywiki and mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571712 (https://phabricator.wikimedia.org/T244980) (owner: 10MarcoAurelio) [19:32:34] (03CR) 10jerkins-bot: [V: 04-1] Localise $wgMetaNamespace for mywiki and mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571712 (https://phabricator.wikimedia.org/T244980) (owner: 10MarcoAurelio) [19:32:39] 10Operations, 10ops-codfw: codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) [19:32:57] Ooh. [19:33:25] (03CR) 10Jforrester: [C: 03+2] "That's novel." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571712 (https://phabricator.wikimedia.org/T244980) (owner: 10MarcoAurelio) [19:34:32] (03Merged) 10jenkins-bot: Localise $wgMetaNamespace for mywiki and mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571712 (https://phabricator.wikimedia.org/T244980) (owner: 10MarcoAurelio) [19:37:05] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T244980 Localise $wgMetaNamespace for mywiki and mywiktionary (duration: 01m 03s) [19:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:09] T244980: Localize project namespaces for Burmese (mywiki, mywikt) - https://phabricator.wikimedia.org/T244980 [19:37:32] 10Operations, 10MediaWiki-API, 10Traffic: Evaluate the feasibility of cache invalidation for the action API - https://phabricator.wikimedia.org/T122867 (10WDoranWMF) [19:38:00] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) [19:38:05] !log Ran mwscript maintenance/namespaceDupes.php --wiki=mywiki --fix and mwscript maintenance/namespaceDupes.php --wiki=mywiktionary --fix on mwmaint1002 [19:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:57] (03PS4) 10Jforrester: Enable new user message for auto-created accounts on zh_classical wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567306 (https://phabricator.wikimedia.org/T243509) (owner: 10Ammarpad) [19:39:26] (03CR) 10Jforrester: [C: 03+2] Enable new user message for auto-created accounts on zh_classical wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567306 (https://phabricator.wikimedia.org/T243509) (owner: 10Ammarpad) [19:39:56] 10Operations, 10Analytics, 10serviceops, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10MoritzMuehlenhoff) Given that Luca also had an error during initial setup related to name resolution, this sounds like some error related to th... [19:40:24] (03Merged) 10jenkins-bot: Enable new user message for auto-created accounts on zh_classical wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567306 (https://phabricator.wikimedia.org/T243509) (owner: 10Ammarpad) [19:41:08] (03PS4) 10Jforrester: Add assignment of 'mover' group to bureaucrats on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566925 (https://phabricator.wikimedia.org/T243503) (owner: 10Ammarpad) [19:41:37] (03PS5) 10Jforrester: [itwiki] Move assignment of 'mover' group from sysops to bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566925 (https://phabricator.wikimedia.org/T243503) (owner: 10Ammarpad) [19:42:05] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T243509 [zh_classicalwiki] Enable new user message for auto-created accounts (duration: 01m 03s) [19:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:09] T243509: Site requests for zh-classical wikipedia - https://phabricator.wikimedia.org/T243509 [19:43:21] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) OK so I've done some tests. It's clear that we can see this CPU/memory spike when shutting down uwsgi... [19:44:58] (03CR) 10Jforrester: [C: 03+2] [itwiki] Move assignment of 'mover' group from sysops to bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566925 (https://phabricator.wikimedia.org/T243503) (owner: 10Ammarpad) [19:46:07] (03Merged) 10jenkins-bot: [itwiki] Move assignment of 'mover' group from sysops to bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/566925 (https://phabricator.wikimedia.org/T243503) (owner: 10Ammarpad) [19:47:20] (03PS2) 10Jforrester: Don't use hex escapes for non-ASCII characters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451004 (owner: 10PleaseStand) [19:47:46] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T243503 [itwiki] Move assignment of 'mover' group from sysops to bureaucrats (duration: 01m 02s) [19:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:50] T243503: Change of mover assignment on it.wikipedia - https://phabricator.wikimedia.org/T243503 [19:48:37] (03CR) 10Jforrester: [C: 03+2] "We already have 'è' chars for this here, e.g. for frpwiki and ocwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451004 (owner: 10PleaseStand) [19:49:42] (03Abandoned) 10Jforrester: Split make wikidataclient dblist compute from www and test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470417 (owner: 10Addshore) [19:49:44] (03Abandoned) 10Jforrester: Consolidate database lists list in one place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 (owner: 10Dereckson) [19:51:20] (03Merged) 10jenkins-bot: Don't use hex escapes for non-ASCII characters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451004 (owner: 10PleaseStand) [19:52:32] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) From https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html > To shutdown uWSGI use SIGINT or S... [19:52:36] (03PS1) 10Cmjohnson: Adding production dns mw13[49-84].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/571785 (https://phabricator.wikimedia.org/T236437) [19:53:27] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Don't use hex escapes in the name of cawiki (duration: 01m 04s) [19:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:10] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns mw13[49-84].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/571785 (https://phabricator.wikimedia.org/T236437) (owner: 10Cmjohnson) [19:56:47] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) I tried removing the `--die-on-term` option and I get the same behavior. [19:58:32] 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar), 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10WDoranWMF) [19:59:11] (03PS5) 10Jforrester: Add tests for wg(Canonical)Server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467445 (https://phabricator.wikimedia.org/T207073) (owner: 10Urbanecm) [19:59:13] (03PS1) 10Jforrester: Set wgServer to protocol-relative for Wikitech and Test Wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571786 [19:59:28] 10Operations, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), 10Performance-Team (Radar), 10Wikimedia-production-error: Wiki diffs take over 15s to load - https://phabricator.wikimedia.org/T244058 (10WDoranWMF) a:03Krinkle [19:59:57] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10WDoranWMF) a:03Pchelolo [20:00:05] marxarelli and James_F: (Dis)respected human, time to deploy Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200212T2000). Please do the needful. [20:00:23] Train'll be in half an hour or so, folks. [20:01:06] (03CR) 10Jforrester: [C: 03+2] "Re-done on the new world. LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467445 (https://phabricator.wikimedia.org/T207073) (owner: 10Urbanecm) [20:01:37] (03CR) 10Jforrester: "This is probably fine, but leaving to Andrew in case I'm missing something." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571786 (owner: 10Jforrester) [20:02:18] (03Merged) 10jenkins-bot: Add tests for wg(Canonical)Server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467445 (https://phabricator.wikimedia.org/T207073) (owner: 10Urbanecm) [20:04:33] (03Abandoned) 10Jforrester: Make it explicit that extension1 contains Echo databases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410079 (owner: 10Niharika29) [20:04:55] (03CR) 10Andrew Bogott: [C: 03+1] "I don't know any reason why this would be bad, but please ping me when you merge it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571786 (owner: 10Jforrester) [20:07:04] (03PS3) 10Jforrester: Enable testing LanguageConverter in sandboxes on deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438079 (https://phabricator.wikimedia.org/T143628) (owner: 10C. Scott Ananian) [20:07:54] (03CR) 10Jforrester: "Rebased. Is this still wanted?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438079 (https://phabricator.wikimedia.org/T143628) (owner: 10C. Scott Ananian) [20:08:28] (03CR) 10Jforrester: "Can we abandon this? It's been half a year." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533023 (https://phabricator.wikimedia.org/T231446) (owner: 10Mathew.onipe) [20:08:35] (03PS1) 10Aaron Schulz: Make deployment-prep $wgWANObjectCaches better match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571792 [20:08:38] (03PS1) 10Aaron Schulz: Set "coalesceKeys" for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571793 [20:09:20] (03PS3) 10Jforrester: wikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484633 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [20:09:22] (03PS4) 10Jforrester: wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484635 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [20:10:21] (03Abandoned) 10Jforrester: Beta Cluster: Enable CSP in report-only mode on all BC wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/451089 (owner: 10Jforrester) [20:10:37] (03CR) 10jerkins-bot: [V: 04-1] Set "coalesceKeys" for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571793 (owner: 10Aaron Schulz) [20:11:11] (03PS5) 10Jforrester: Create a FeaturedFeed for the News on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439436 (https://phabricator.wikimedia.org/T165773) (owner: 10Aklapper) [20:11:37] (03CR) 10Jforrester: "Is this still wanted?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439436 (https://phabricator.wikimedia.org/T165773) (owner: 10Aklapper) [20:12:10] (03PS2) 10Jforrester: Set $wgUploadNavigationUrl for few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364121 (https://phabricator.wikimedia.org/T170083) (owner: 10Framawiki) [20:12:25] (03CR) 10Jforrester: "Is this still wanted?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364121 (https://phabricator.wikimedia.org/T170083) (owner: 10Framawiki) [20:13:20] (03CR) 10Jforrester: "Is this still wanted?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 (owner: 10KartikMistry) [20:13:28] (03CR) 10Jforrester: "Is this still wanted?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345179 (https://phabricator.wikimedia.org/T160887) (owner: 10Daniel Kinzler) [20:13:47] (03CR) 10Jforrester: "Is this still wanted?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354608 (https://phabricator.wikimedia.org/T163107) (owner: 10Nemo bis) [20:13:54] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10HMarcus) @MoritzMuehlenhoff I received an answer from Jumpcloud's support engineers: "It is important to note that we support bind, search, a... [20:14:18] (03Abandoned) 10Jforrester: Enable ValidationStatistics log for FlaggedRevs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354615 (https://phabricator.wikimedia.org/T163107) (owner: 10Nemo bis) [20:15:53] (03Abandoned) 10Jforrester: Use Title_blacklist as a local page on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330328 (https://phabricator.wikimedia.org/T154112) (owner: 10Gergő Tisza) [20:17:19] (03Abandoned) 10Jforrester: Enable mapframe on uawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352622 (https://phabricator.wikimedia.org/T164574) (owner: 10DatGuy) [20:19:14] 10Operations, 10ops-eqsin, 10ops-ulsfo: snag asset tags from ulsfo, ship some to eqsin - https://phabricator.wikimedia.org/T245056 (10RobH) p:05Triage→03High [20:21:29] (03PS1) 10RhinosF1: Insert the description of the change. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571795 [20:21:31] (03CR) 10Jforrester: "Five years later, this is still in an odd set-up:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228618 (https://phabricator.wikimedia.org/T90612) (owner: 10Legoktm) [20:21:34] (03CR) 10Jforrester: "Is this going anywhere? It's been four years…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258943 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [20:21:37] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) Firmware upgrades I applied: Dell PERC H730/H730P/H830/FD33xS/FD33xD Mini/Adapter RAID Controllers firmware version 25.5.6.0009 https://ww... [20:27:40] (03PS7) 10Ottomata: eventgate - Use main_app.name as primary resource grouping, not wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 (https://phabricator.wikimedia.org/T242861) [20:28:55] (03PS2) 10Krinkle: Disable wgLegacyJavaScriptGlobals on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558621 (https://phabricator.wikimedia.org/T72470) [20:29:45] jouncebot: next [20:29:45] In 0 hour(s) and 30 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200212T2100) [20:29:49] jouncebot: now [20:29:49] For the next 0 hour(s) and 30 minute(s): Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200212T2000) [20:30:05] (03Abandoned) 10RhinosF1: Insert the description of the change. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571795 (owner: 10RhinosF1) [20:30:07] freenode maintenance starting soon so there's a head's up [20:30:24] James_F: you commanding the train today? [20:30:58] Krinkle: No, I'm second to marxarelli. [20:31:10] Krinkle: If it's urgent, go ahead. [20:31:14] "nickserv, chanserv and related services" [20:31:34] I didn't see a date, just "soon" [20:31:37] (03CR) 10Krinkle: "fwiw, once this lands - best have a unit test to enforce it as set expicitly for all wikis. We don't want this to rely on run-time inherit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228618 (https://phabricator.wikimedia.org/T90612) (owner: 10Legoktm) [20:31:42] Krinkle: Also, mild reminder re. the stack at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/558774 [20:31:55] Yeah, was gonna do that and https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/558621/ [20:32:00] today if time permits [20:32:03] Go for it. [20:32:05] ok [20:32:11] (03CR) 10Krinkle: [C: 03+2] Disable wgLegacyJavaScriptGlobals on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558621 (https://phabricator.wikimedia.org/T72470) (owner: 10Krinkle) [20:34:20] (03CR) 10Krinkle: "... which was done in https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/467445/5 awesome" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228618 (https://phabricator.wikimedia.org/T90612) (owner: 10Legoktm) [20:34:49] Krinkle: You're welcome. ;-) [20:35:07] But that doesn't actually assert that wgServer is set, just that if it is set it has particular values. [20:35:26] nickserv/chanserv maintenance in 10 mins fyi [20:35:27] ah you're right [20:35:29] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) Here's an `strace` of one of the child processes that goes berserk: ` strace: Process 25405 attached... [20:35:32] Because 'default' values exist for most of those. [20:35:39] should be short, but if we want an op in the channel they ought to do it now [20:35:40] Krinkle: I'm working on it. [20:35:45] (03Merged) 10jenkins-bot: Disable wgLegacyJavaScriptGlobals on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558621 (https://phabricator.wikimedia.org/T72470) (owner: 10Krinkle) [20:35:56] apergos: I'm not a chanop here, sorry. [20:36:28] just passig along the info [20:36:35] somehow I am? :) [20:36:35] Thanks, cdanis. [20:36:51] Clearly you're a lot more trustworthy than us peons in RelEng. ;-) [20:37:07] (03PS8) 10Ottomata: eventgate - Use main_app.name as primary resource grouping, not wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 (https://phabricator.wikimedia.org/T242861) [20:38:55] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T72470 - Disable wgLegacyJavaScriptGlobals on svwiki (duration: 01m 08s) [20:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:09] T72470: Remove legacy javascript globals - https://phabricator.wikimedia.org/T72470 [20:40:15] (03PS2) 10Krinkle: varnish: Remove duplicate 'Content-Type: text/html' statement [puppet] - 10https://gerrit.wikimedia.org/r/558752 [20:41:26] (03PS3) 10Krinkle: CommonSettings.php: Remove 'SERVER_SOFTWARE' override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558763 (https://phabricator.wikimedia.org/T232563) [20:41:29] (03CR) 10Krinkle: [C: 03+2] CommonSettings.php: Remove 'SERVER_SOFTWARE' override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558763 (https://phabricator.wikimedia.org/T232563) (owner: 10Krinkle) [20:42:01] (03PS2) 10Krinkle: etcd: Set globals explicitly in CommonSettings instead of etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558774 [20:42:44] * Krinkle krinkle@deploy1001$ touch /var/lock/scap-global-lock [20:43:47] (03PS4) 10Krinkle: CommonSettings.php: Remove 'SERVER_SOFTWARE' override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558763 (https://phabricator.wikimedia.org/T232563) [20:43:57] (03CR) 10Krinkle: [C: 03+2] CommonSettings.php: Remove 'SERVER_SOFTWARE' override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558763 (https://phabricator.wikimedia.org/T232563) (owner: 10Krinkle) [20:44:05] James_F: o/ [20:44:39] Hey marxarelli. Krinkle is doing some clean-up deploys as the train wasn't rolling. [20:44:44] (03PS1) 10Cmjohnson: Adding macs mw servers. Missing mw1351,79-81 for bad passwd [puppet] - 10https://gerrit.wikimedia.org/r/571797 (https://phabricator.wikimedia.org/T236437) [20:44:50] no prob [20:44:52] Should be good to go in a few mins. [20:44:58] (03Merged) 10jenkins-bot: CommonSettings.php: Remove 'SERVER_SOFTWARE' override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558763 (https://phabricator.wikimedia.org/T232563) (owner: 10Krinkle) [20:45:04] i will make a coffee [20:45:23] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Cmjohnson) [20:46:51] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Cmjohnson) @Jclark-ctr when you're at data center later today could you look into these please. mw1351 and mw1381 have the wrong password... [20:47:16] * Krinkle notices deprecation warnings in prod from SkinTemplate and RL/LessModule about wgLogoHD [20:47:32] from officewiki and a bunch of misc small wikis [20:49:23] James_F: marxarelli: I'll do the rest of the etcd stack later. Good to go after the current sync-file is done [20:49:35] (03PS9) 10Ottomata: eventgate - Use main_app.name as primary resource grouping, not wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 (https://phabricator.wikimedia.org/T242861) [20:49:40] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: T232563 - Remove SERVER_SOFTWARE override (duration: 01m 03s) [20:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:45] T232563: Drop IE6 and IE7 basic compatibility and security support - https://phabricator.wikimedia.org/T232563 [20:49:48] (03PS2) 10Krinkle: etcd: Set $wmfEtcdLastModifiedIndex from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558775 [20:50:21] Cool. [20:50:36] Krinkle: ack-thanks [20:51:32] (03PS3) 10Krinkle: etcd: Add $etcdHost parameter to wmfSetupEtcd() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558776 [20:52:29] (03PS10) 10Ottomata: eventgate - Use main_app.name as primary resource grouping, not wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 (https://phabricator.wikimedia.org/T242861) [20:52:37] (03PS2) 10Krinkle: etcd: Set wmfSetupEtcd($etcdHost) from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558777 [20:53:05] (03PS2) 10Cmjohnson: Adding macs mw servers. Missing mw1351,79-81 for bad passwd [puppet] - 10https://gerrit.wikimedia.org/r/571797 (https://phabricator.wikimedia.org/T236437) [20:53:54] (03PS3) 10Cmjohnson: Adding macs mw servers. Missing mw1351,79-81 for bad passwd [puppet] - 10https://gerrit.wikimedia.org/r/571797 (https://phabricator.wikimedia.org/T236437) [20:54:11] (03PS11) 10Ottomata: eventgate - Use main_app.name as primary resource grouping, not wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 (https://phabricator.wikimedia.org/T242861) [20:54:54] (03PS3) 10Krinkle: etcd: Set globals explicitly in CommonSettings instead of etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558774 [20:55:00] (03PS3) 10Krinkle: etcd: Set $wmfEtcdLastModifiedIndex from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558775 [20:55:07] (03CR) 10Ottomata: [C: 03+2] eventgate - Use main_app.name as primary resource grouping, not wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/564052 (https://phabricator.wikimedia.org/T242861) (owner: 10Ottomata) [20:55:22] (03CR) 10Cmjohnson: [C: 03+2] Adding macs mw servers. Missing mw1351,79-81 for bad passwd [puppet] - 10https://gerrit.wikimedia.org/r/571797 (https://phabricator.wikimedia.org/T236437) (owner: 10Cmjohnson) [20:55:59] (03PS1) 10Dduvall: group1 wikis to 1.35.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571798 [20:56:01] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.35.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571798 (owner: 10Dduvall) [20:56:23] (03PS4) 10Krinkle: etcd: Add $etcdHost parameter to wmfSetupEtcd() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558776 [20:56:40] (03PS3) 10Krinkle: etcd: Set wmfSetupEtcd($etcdHost) from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558777 [20:56:50] (03PS4) 10Krinkle: etcd: Pass wmfSetupEtcd($etcdHost) from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558777 [20:57:23] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571798 (owner: 10Dduvall) [20:57:50] James_F: "choo-choo" [20:58:16] Cool. [20:58:50] (03PS4) 10Zoranzoki21: Throttle rule for National Gallery of Canada Library and Archives edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) (owner: 10Superzerocool) [20:59:17] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.19 [20:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:20] (03PS5) 10Zoranzoki21: Throttle rule for National Gallery of Canada Library and Archives edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) (owner: 10Superzerocool) [20:59:42] (03PS1) 10Krinkle: etcd: Use require_once for etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571799 [21:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200212T2100). [21:00:21] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.19 (duration: 01m 03s) [21:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:36] marxarelli: Seems OK so far. [21:01:19] James_F: same here [21:04:47] Over to service deploys, then. [21:05:04] (03PS1) 10Papaul: DHCP: Add MAC address entries for mw2291 to mw2309 [puppet] - 10https://gerrit.wikimedia.org/r/571800 (https://phabricator.wikimedia.org/T241852) [21:09:23] (03CR) 10Zoranzoki21: [C: 03+1] "Scheduled for Morning SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) (owner: 10Superzerocool) [21:10:25] !log completed group1 to 1.35.0-wmf.19 [21:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:36] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10MMiller_WMF) 05Open→03Declined Given that it doesn't sound like this is needed. I'm declining the task. Pl... [21:11:37] (03CR) 10Thcipriani: "recheck" [homer/mock-private] - 10https://gerrit.wikimedia.org/r/571374 (owner: 10Legoktm) [21:12:53] (03PS1) 10Ottomata: eventgate - Name ConfigMaps using full wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/571801 (https://phabricator.wikimedia.org/T242861) [21:13:57] (03CR) 10Ottomata: [C: 03+2] eventgate - Name ConfigMaps using full wmf.releasename [deployment-charts] - 10https://gerrit.wikimedia.org/r/571801 (https://phabricator.wikimedia.org/T242861) (owner: 10Ottomata) [21:17:32] (03PS1) 10Jforrester: Test that every wiki is in exactly one of our wiki families [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571802 [21:17:34] (03PS1) 10Jforrester: Test that every wiki has wgServer, wgCanonicalServer, and wgLanguageCode set or inherited [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571803 [21:17:42] (03PS1) 10Jforrester: Update composer sub-dependencies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571804 [21:18:40] Krinkle: ^^ A little messy, but they work. Note that they don't assert that metawiki has wgLanguageCode explicitly set; I guess we could assert that it's not set to ''? [21:18:48] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [21:18:48] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [21:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:07] freenode maintenance over [21:19:34] Hmm. A handful of "Cache key contains characters that are not allowed: `P180_11`" errors from Wikidata. [21:19:52] 10Operations, 10conftool, 10Wikimedia-Incident: Create an automated alert for 'too many nodes depooled from a service' - https://phabricator.wikimedia.org/T245058 (10CDanis) [21:20:01] what sort of cache key is that hrm [21:20:36] 10Operations, 10conftool, 10Wikimedia-Incident: depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled - https://phabricator.wikimedia.org/T245059 (10CDanis) [21:21:20] 10Operations, 10Pybal, 10Wikimedia-Incident: Pybal should reject a confctl configuration that indicates only one cp-text is pooled - https://phabricator.wikimedia.org/T245060 (10CDanis) [21:25:26] (03CR) 10Jforrester: "recheck" [homer/mock-private] - 10https://gerrit.wikimedia.org/r/540622 (owner: 10Ayounsi) [21:25:50] (03PS2) 10Herron: logstash: output logs ingested by deprecated inputs to kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/571548 (https://phabricator.wikimedia.org/T234854) [21:27:37] P180 is depicts, if that's helpfull (so likely related to SDC or mediawikivision-something-something [21:27:40] ) [21:28:30] (03CR) 10Herron: logstash: output logs ingested by deprecated inputs to kafka-logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571548 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:29:37] (03PS3) 10Herron: logstash: output logs ingested by deprecated inputs to kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/571548 (https://phabricator.wikimedia.org/T234854) [21:30:24] *machinevision [21:30:39] Or that, yeah. [21:34:08] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Bugreporter) I think increase the factor will not make thing better, it only increase the oscillating p... [21:34:09] (03PS2) 10Herron: logstash::collector7 ingest deprecated logs from kafka [puppet] - 10https://gerrit.wikimedia.org/r/571554 (https://phabricator.wikimedia.org/T234854) [21:37:25] (03CR) 10Thcipriani: "recheck" [homer/mock-private] - 10https://gerrit.wikimedia.org/r/571374 (owner: 10Legoktm) [21:37:54] (03PS1) 10Herron: logstash: remove defalut value from kafka input type field [puppet] - 10https://gerrit.wikimedia.org/r/571813 (https://phabricator.wikimedia.org/T234854) [21:39:55] (03PS1) 10Ottomata: Update eventgate/README.md [deployment-charts] - 10https://gerrit.wikimedia.org/r/571814 (https://phabricator.wikimedia.org/T242861) [21:39:57] (03PS1) 10Ottomata: eventgate - fix names of volume mounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/571815 (https://phabricator.wikimedia.org/T242861) [21:40:27] (03CR) 10Ottomata: [C: 03+2] Update eventgate/README.md [deployment-charts] - 10https://gerrit.wikimedia.org/r/571814 (https://phabricator.wikimedia.org/T242861) (owner: 10Ottomata) [21:40:33] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Bugreporter) If the rate of edit Query Updater can handle is a constant, changing the factor will not a... [21:40:58] (03CR) 10Ottomata: [C: 03+2] eventgate - fix names of volume mounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/571815 (https://phabricator.wikimedia.org/T242861) (owner: 10Ottomata) [21:41:23] (03CR) 10Herron: logstash::collector7 ingest deprecated logs from kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571554 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:43:11] 10Operations, 10ops-ulsfo: audit cable labels @ ulsfo - https://phabricator.wikimedia.org/T238856 (10RobH) 05Open→03Resolved 4 missing ids, all but 1 had labels already applied, and 1 had new labels applied. The duplicate id was 1240, not 1241 (dupe). [21:47:44] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [21:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:01] (03CR) 10Krinkle: Test that every wiki is in exactly one of our wiki families (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571802 (owner: 10Jforrester) [21:49:57] (03CR) 10Jforrester: Test that every wiki is in exactly one of our wiki families (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571802 (owner: 10Jforrester) [21:51:00] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [21:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:26] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [21:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:07] PROBLEM - dhclient process on cumin1001 is CRITICAL: connect to address 10.64.32.25 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:52:11] PROBLEM - Disk space on cumin1001 is CRITICAL: connect to address 10.64.32.25 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cumin1001&var-datasource=eqiad+prometheus/ops [21:52:15] PROBLEM - Check the NTP synchronisation status of timesyncd on cumin1001 is CRITICAL: connect to address 10.64.32.25 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [21:52:17] PROBLEM - DPKG on cumin1001 is CRITICAL: connect to address 10.64.32.25 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:52:57] PROBLEM - Check the last execution of git_pull_httpbb on cumin1001 is CRITICAL: connect to address 10.64.32.25 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:53:20] who was running jq on cumin1001? [21:53:24] !log restart nagios-nrpe-service on cumin1001 after it had oomed [21:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:33] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: les sans pagEs Mailing List - https://phabricator.wikimedia.org/T245066 (10Anthere) [21:54:03] RECOVERY - dhclient process on cumin1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [21:54:03] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: les sans pagEs Mailing List - https://phabricator.wikimedia.org/T245066 (10Anthere) [21:54:07] RECOVERY - Disk space on cumin1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cumin1001&var-datasource=eqiad+prometheus/ops [21:54:11] idk [21:54:13] RECOVERY - DPKG on cumin1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:54:36] just oomed [21:56:33] indeed [21:56:46] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: les sans pagEs Mailing List - https://phabricator.wikimedia.org/T245066 (10Reedy) [21:57:12] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [21:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:19] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [21:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:06] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [21:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:23] (03CR) 10Krinkle: Test that every wiki is in exactly one of our wiki families (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571802 (owner: 10Jforrester) [22:01:46] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10ArthurPSmith) @Bugreporter > I think increase the factor will not make thing better, it only increase... [22:02:10] (03CR) 10Thcipriani: "recheck" [homer/mock-private] - 10https://gerrit.wikimedia.org/r/571374 (owner: 10Legoktm) [22:02:22] (03CR) 10jerkins-bot: [V: 04-1] [DNM] testing jenkins [homer/mock-private] - 10https://gerrit.wikimedia.org/r/571374 (owner: 10Legoktm) [22:03:51] RECOVERY - Check the last execution of git_pull_httpbb on cumin1001 is OK: OK: Status of the systemd unit git_pull_httpbb https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:05:25] (03CR) 10Krinkle: Test that every wiki has wgServer, wgCanonicalServer, and wgLanguageCode set or inherited (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571803 (owner: 10Jforrester) [22:06:04] (03CR) 10Krinkle: [C: 03+1] "Good start either way, non-essential CR." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571803 (owner: 10Jforrester) [22:07:42] (03PS1) 10Andrew Bogott: Openstack nova: update cloud-init userdata [puppet] - 10https://gerrit.wikimedia.org/r/571817 (https://phabricator.wikimedia.org/T181375) [22:08:31] (03Abandoned) 10Volans: [DNM] testing jenkins [homer/mock-private] - 10https://gerrit.wikimedia.org/r/571374 (owner: 10Legoktm) [22:08:47] (03CR) 10Volans: "recheck" [homer/mock-private] - 10https://gerrit.wikimedia.org/r/571367 (https://phabricator.wikimedia.org/T244690) (owner: 10Volans) [22:09:18] (03CR) 10Volans: [C: 03+2] config: add tox env to lint YAML files [homer/mock-private] - 10https://gerrit.wikimedia.org/r/571367 (https://phabricator.wikimedia.org/T244690) (owner: 10Volans) [22:09:37] (03Merged) 10jenkins-bot: config: add tox env to lint YAML files [homer/mock-private] - 10https://gerrit.wikimedia.org/r/571367 (https://phabricator.wikimedia.org/T244690) (owner: 10Volans) [22:14:53] (03CR) 10Jforrester: Test that every wiki has wgServer, wgCanonicalServer, and wgLanguageCode set or inherited (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571803 (owner: 10Jforrester) [22:17:50] (03CR) 10Andrew Bogott: [C: 03+2] Openstack nova: update cloud-init userdata [puppet] - 10https://gerrit.wikimedia.org/r/571817 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [22:19:25] Request from 174.238.146.117 via cp1081.eqiad.wmnet, ATS/8.0.5 [22:19:25] Error: 504, Connection Timed Out at 2020-02-12 22:15:23 GMT [22:19:25] --- I keep getting these errors when attempting to load https://en.wikipedia.org/wiki/Special:RecentChangesLinked/Category:Living_people --- this has been an ongoing issue for at least a month now, just thought I'd mention it here if no one else has brought the issue up [22:19:35] 10Operations: Script to point SRE local machine traffic to another LB - https://phabricator.wikimedia.org/T244761 (10RLazarus) Maybe separate from a script, is there any way we can do this via DNS? Something like `grafana.cp-ulsfo.wikimedia.org` to specify a site rather than accepting geoip routing. @CDanis poi... [22:19:47] (the issue is intermittent) [22:23:09] RECOVERY - Check the NTP synchronisation status of timesyncd on cumin1001 is OK: OK: synced at Wed 2020-02-12 22:23:07 UTC. https://wikitech.wikimedia.org/wiki/NTP [22:33:25] 10Operations, 10ops-codfw, 10ops-eqiad, 10ops-eqsin, and 2 others: Audit & update spares part tracking for all sites - https://phabricator.wikimedia.org/T243450 (10RobH) [22:33:36] 10Operations, 10ops-eqsin, 10ops-ulsfo: snag asset tags from ulsfo, ship some to eqsin - https://phabricator.wikimedia.org/T245056 (10RobH) WMF7236 - WMF7247 snagged to mail to eqsin [22:36:53] (03CR) 10EBernhardson: [C: 03+1] "Looks good, verified named models were uploaded. Should be good to ship in https://wikitech.wikimedia.org/wiki/SWAT_deploys" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559614 (https://phabricator.wikimedia.org/T219534) (owner: 10Mstyles) [22:39:55] 10Operations, 10Toolforge: mirrors.wikimedia.org libgtk-3-common all 3.22.11-1 hash mismatch - https://phabricator.wikimedia.org/T245071 (10bd808) [22:39:55] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [22:50:36] (03PS2) 10Jforrester: Test that every wiki has wgServer, wgCanonicalServer, and wgLanguageCode set or inherited [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571803 [22:50:38] (03PS2) 10Jforrester: Update composer sub-dependencies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571804 [22:51:02] (03Abandoned) 10Jforrester: Test that every wiki is in exactly one of our wiki families [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571802 (owner: 10Jforrester) [22:51:58] (03PS2) 10Papaul: DHCP: Add MAC address entries for mw2291 to mw2309 [puppet] - 10https://gerrit.wikimedia.org/r/571800 (https://phabricator.wikimedia.org/T241852) [23:02:54] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address entries for mw2291 to mw2309 [puppet] - 10https://gerrit.wikimedia.org/r/571800 (https://phabricator.wikimedia.org/T241852) (owner: 10Papaul) [23:03:07] (03PS3) 10Papaul: DHCP: Add MAC address entries for mw2291 to mw2309 [puppet] - 10https://gerrit.wikimedia.org/r/571800 (https://phabricator.wikimedia.org/T241852) [23:03:16] (03CR) 10Papaul: [V: 03+2 C: 03+2] DHCP: Add MAC address entries for mw2291 to mw2309 [puppet] - 10https://gerrit.wikimedia.org/r/571800 (https://phabricator.wikimedia.org/T241852) (owner: 10Papaul) [23:03:36] (03PS1) 10Bstorm: cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) [23:06:12] (03CR) 10Bstorm: [C: 04-1] "This is not even close to a "safe" patch. It is deployed on VMs right now for both stretch and jessie (and works great so far). However, " [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [23:08:24] (03CR) 10Bstorm: [C: 04-1] "This is targeted only at the primary cluster for now (and the VMs). This should be something we can generalize to the secondary cluster a" [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [23:11:28] !log deactivate BGP to office's router1 while it's on maintenance [23:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:19] (03PS1) 10BryanDavis: nfs-mounts: expose the scratch mount in the language project [puppet] - 10https://gerrit.wikimedia.org/r/571827 (https://phabricator.wikimedia.org/T242187) [23:19:03] (03CR) 10Bstorm: [C: 04-1] "PCC for nfs servers on the primary cluster: https://puppet-compiler.wmflabs.org/compiler1002/20776/" [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [23:23:49] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2291.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:23:54] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2291.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2291.codfw.wmnet'] ` [23:26:04] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2291.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:29:47] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2292.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:29:51] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2292.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2292.codfw.wmnet'] ` [23:30:17] (03PS1) 10Bstorm: labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 [23:31:31] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2292.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:39:20] (03PS2) 10Bstorm: nfs-mounts: expose the scratch mount in the language project [puppet] - 10https://gerrit.wikimedia.org/r/571827 (https://phabricator.wikimedia.org/T242187) (owner: 10BryanDavis) [23:40:31] 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10wiki_willy) Latest update: there were some settings that need to be redone, so @Cmjohnson will get these fixed tomorrow, with Thursday as the new ETA. Apologies... [23:41:07] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [23:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:24] (03CR) 10Bstorm: [C: 04-1] cloudstore: remove dependency on bind mounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [23:41:30] (03PS1) 10VolkerE: Remove unnecessary id from wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571836 [23:43:13] (03CR) 10Jforrester: "Should this have a title set (to "WikipediA®"?)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571836 (owner: 10VolkerE) [23:43:21] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:50] (03CR) 10Bstorm: [C: 04-1] cloudstore: remove dependency on bind mounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [23:44:09] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [23:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:41] (03CR) 10Bstorm: [C: 03+2] nfs-mounts: expose the scratch mount in the language project [puppet] - 10https://gerrit.wikimedia.org/r/571827 (https://phabricator.wikimedia.org/T242187) (owner: 10BryanDavis) [23:46:24] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:39] (03PS2) 10Bstorm: cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) [23:47:54] (03PS2) 10Bstorm: labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 [23:50:07] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2291.codfw.wmnet'] ` and were **ALL** successful. [23:51:07] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2292.codfw.wmnet'] ` and were **ALL** successful. [23:51:19] (03CR) 10Jforrester: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/526255 (https://phabricator.wikimedia.org/T227734) (owner: 10Jforrester) [23:51:22] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2293.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:51:31] (03PS1) 10VolkerE: Fix latin Wikipedia (VICIPÆDIA) wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571838 (https://phabricator.wikimedia.org/T240728) [23:51:49] (03CR) 10Jforrester: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/526255 (https://phabricator.wikimedia.org/T227734) (owner: 10Jforrester) [23:53:08] (03CR) 10Bstorm: [C: 04-1] "For reviewers: the primary NFS server in the cloudstore project where this is live (or was if it doesn't get blown away by something) is c" [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [23:53:31] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2294.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [23:54:00] 10Operations, 10netops: Upgrade routers - https://phabricator.wikimedia.org/T243080 (10ayounsi) cr1-eqsin is back to normal, next step is to plan esams. [23:55:26] (03PS2) 10VolkerE: Fix latin Wikipedia (VICIPÆDIA) wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571838 (https://phabricator.wikimedia.org/T240728) [23:58:26] 10Operations, 10MediaWiki-Revision-backend: Compress data at external storage - https://phabricator.wikimedia.org/T106386 (10Tgr)